1 Introduction and Related Work

In this paper, we present a new family of algorithms solving the single keyword string pattern matching problem. This particular pattern matching problem can be described as follows: given an input string S and a keyword p, find all occurrences of p as a continuous substring of S. The field of string pattern matching is generally well-studied (some thought it to be exhausted by the end of the 1970’s), however, it continues to yield new and exciting algorithms, as was seen in annual conferences such as Combinatorial Pattern Matching and Stringology. In [8] (a dissertation by the last author of this paper), a taxonomy of existing algorithms was presented, along with a number of new algorithms. Any given algorithm may have more than one possible derivation, leading to different classifications of the algorithm in a taxonomyFootnote 1. Many of the new derivations can prove to be more than just an educational curiosity, possibly leading to interesting new families of algorithms. This paper presents one such family, with some new algorithms and also some alternative derivations of existing ones. The algorithms presented in this paper have been extended to handle some more complex pattern matching problems, including multiple keyword pattern matching, regular pattern matching and multi-dimensional pattern matching. For some recent examples of this, see [9,10,11,12].

2 Mathematical Preliminaries

While most of the mathematical notation and definitions used in this paper are is described in detail in [4], here we present some more specific notations. Indexing within strings begins at 0, as in the C and C++ programming languages. We use ranges of integers throughout the paper which are defined by (for integers i and j):

$$ [i,j) = \{\,k \mid i \le k < j\,\} $$
$$ (i,j] = \{\,k \mid i < k \le j\,\} $$
$$ [i,j] = [i,j) \cup (i,j] $$
$$ (i,j) = [i,j) \cap (i,j] $$

In addition, we define a permutation of a set of integers to be a bijective mapping of those integers onto themselves.

3 The Problem and a First Algorithm

Before giving the problem specification (in the form of a postcondition to the algorithms), we define a predicate which will make the postcondition and algorithms easier to read. Keyword p (with the restriction that \(p \ne \varepsilon \), where \(\varepsilon \) is the empty string) is said to match at position j in input string S if \(p = S_{j \cdots j+|p|-1}\); this is restated in the following predicate:

Definition 3.1

(Predicate Matches): We define predicate Matches as

$$ \textit{Matches}(S,p,j) \equiv (p = S_{j \cdots j+|p|-1}) $$

   \(\square \)

The pattern matching problem requires us to compute the set of all matches of keyword p in input string S. We register the matches as the set O (for “output”) of all indices j (in S) such that Matches(Spj) holds.

Definition 3.2

(Single keyword pattern matching problem): Given a common alphabet V, input string S, and pattern keyword p, the problem is defined using postcondition PM:

$$ O = \{\,j \mid j \in [0,|S|) \mathrel {\wedge }\textit{Matches}(S,p,j)\,\} $$

Note that this postcondition implicitly depends upon S and p, even though we do not make that explicit.    \(\square \)

We can now present a nondeterministic algorithm which keeps track of the set of possible indices (in S) at which a match might still be found (indices at which we have not yet checked for a match). This set is known as the live zone. Those indices not in the live zone are said to be in the dead zone. This give us our first algorithm (presented in Dijkstra’s pseudocode [1, 3, 6]).

figure a

The invariant specifies that \({ live}\) and \({ dead}\) are disjoint and account for all indices in S; additionally, any match at an element of \({ dead}\) has already been registered. Thanks to this relationship between \({ live}\) and \({ dead}\), we could have written the repetition condition \({ live} \ne \varnothing \) as \({ dead} \ne [0,|S|)\), and the j selection condition \(j \in { live}\) as \(j \not \in { dead}\). It should be easy to see that the invariant and the termination condition of the repetition implies the postcondition—yielding a correct algorithm. Note that this algorithm is highly over-specified by keeping both variables live and dead to represent the live and dead zones, respectively. For efficiency, only one of these sets would normally be kept, as is seen in [9,10,11].

Some of the rightmost positions in S cannot possibly accommodate matches—no match can be found at any point \(j \in [|S| - |p| + 1, |S|)\) since \(|S_{j \cdots |S| - 1}| \le |S_{|S|-|p|+1 \cdots |S|-1}|< |p|\) (the match attempt begins too close to the end of S for p to fit). For this reason, we safely change the initializations of \({ live}\) and \({ dead}\) to

$$ { live}, { dead} :=[0,|S|-|p|], [|S|-|p|+1,|S|) $$

In the next section, give a more deterministic (more realistically implemented) version of the last algorithm.

4 A More Deterministic Algorithm

In the last algorithm, our comparison of p with \(S_{j \cdots j+|p|-1}\) is embedded within the evaluation of predicate Matches. In this section, we make this comparison explicit. We begin by noting that \(p = S_{j \cdots j+|p|-1}\) is equivalent to comparing the individual symbols \(p_k\) of p with the corresponding symbols \(S_{j+k}\) of S (for \(k \in [0,|p|)\)). In fact, we can consider the symbols in any order whatsoever. To determine the order in which they will be considered, we introduce match orders:

Definition 4.1

(Match order): We define a match order \({ mo}\) as a permutation on \([0,|p|)\).    \(\square \)

Using \({ mo}\), we can restate our match predicate.

Property 4.2

(Predicate Matches): Predicate Matches is restated as

   \(\square \)

This rendition of the predicate will be evaluated by a repetition which uses a new integer variable i to step from 0 to \(|p|-1\), comparing \(p_{{ mo}(i)}\) to the corresponding symbol of S. As i increases, the repetition has the following invariant:

and terminates as early as possible.

In the following algorithm, we use the match order \({ mo}\), the new repetition and our previous optimization to the initializations of \({ dead}\) and \({ live}\).

figure b

The operator P cand Q appears in the guard of the inner loop of the above algorithm. This operator is similar to conjunction \(P \mathrel {\wedge }Q\) except that if the first conjunct evaluates to false then the second conjunct is not even evaluated. This proves to be a useful property in cases such as the loop guard since, if the first conjunct (\(i < |p|\)) is false (hence \(i >= |p|\), and indeed \(i = |p|\)), then the term \({ mo}(i)\) appearing in the second conjunct is not even defined. Note that the implication within the second conjunct of the loop postcondition is derived from the loop guard, forcing the implication operator to be conditional as well (that is, if \(i < |p|\) is determined to be false, then \(p_{{ mo}(i)} \ne S_{j+{ mo}(i)}\) is not even evaluated).

5 Reusing Match Information

On each iteration of the outer repetition, index j is chosen and eliminated from the live zone in the statement:

$$ { live}, { dead} :={ live} \setminus \{j\}, { dead} \cup \{j\} $$

The performance of the algorithm can be improved if we remove more than just j in some of the iterations. To do this, we can use some of the match information, such as i, which indicates how far through \({ mo}\) the match attempt proceeded before finding a mismatching symbol. The information most readily available is the postcondition of the inner repetition:

We denote this postcondition by \({ Result}(S,p,i,j)\). Since this postcondition holds, we may be able to deduce that certain indices in S cannot possibly be the site of a match. It is such indices which we could also remove from the live zone. They are formally characterized as:

$$\begin{aligned} \{\,x \mid x \in [0,|S|) \mathrel {\wedge }(\textit{Result}(S,p,i,j) \Rightarrow \lnot \textit{Matches}(S,p,x))\,\} \end{aligned}$$
(1)

Determining this set at pattern matching time is inefficient and not easily implemented. We wish to derive a safe approximation of this set which can be precomputed, tabulated and indexed (at pattern matching time) by i. In order to precompute it, the approximation must be independent of j and S. We wish to find a strengthening of the range predicate since this will allow us to still remove a safe set of elements from set \({ live}\), thanks to the property that, if \(P \Rightarrow Q\) (P is a strengthening of Q, and Q is a weakening of P), then

$$ \{\,x \mid P(x)\,\} \subseteq \{\,x \mid Q(x)\,\} $$

As a first step towards this approximation, we can normalize the ideal set (Eq. (1) above), by subtracting j from each element. The resulting characterization will be more useful for precomputation reasons:

$$ \{\,x \mid x \in [-j,|S|-j) \mathrel {\wedge }(\textit{Result}(S,p,i,j) \Rightarrow \lnot \textit{Matches}(S,p,j+x))\,\} $$

Note that this still depends upon j, however, it will make some of the derivation steps shown shortly in Sect. 5.1 easier. Because those steps are rather detailed, they are presented in isolation. Condensed, the derivation appears as:

figure c

Note that we define the predicate Approximation(pix) which depends only on p and i and hence can be precomputed and tabulated. It should be mentioned that this is one of several possible useful strengthenings which could be derived. We could even have used the strongest predicate, false, instead of Approximation(pix). This would yield the empty set, \(\varnothing \), to be removed from \({ live}\) in addition to j (as in the previous algorithm).

We can derive a smaller range predicate of x for which we have to check if Approximation(pix) holds. Notice that choosing and x such that \([x,|p|+x) \cap [0,|p|) = \varnothing \) has two important consequences:

  • The range of the quantification in first conjunct of Approximation(pix) is empty (hence this conjunct is true, by the definition of universal quantification with an empty range).

  • The range condition of the second conjunct (the ‘implicator’) is false—hence the whole of the second conjunct is true since false \( \Rightarrow \) P for all predicates P.

With this choice of x, we see that predicate Approximation(pix) always evaluates to false, in which case we need not even consider values of x such that \([x,|p|+x) \cap [0,|p|) = \varnothing \). As a result, we characterize those x for which \([x,|p|+x) \cap [0,|p|) \ne \varnothing \) as follows:

figure d

Clearly we can use the restriction \(x \in [1-|p|, |p|-1]\). Intuitively (and information theoretically), we know that there must be such a range restriction since we can not possibly know from a current match attempt whether or not we will find a match of p in S more than \(|p|\) symbols away.

Finally we have the following algorithm (in which we have added the additional update of \({ live}\) and \({ dead}\) below the inner repetition). Note that we introduce the set \({ nogood}\) to accumulate the indices for which Approximation(pix) holds. Also note that we renormalize the set \({ nogood}\) by adding j to each of its members and ensuring that it is within the valid range of indices, \([0,|S|)\).

figure e

5.1 Range Predicate Strengthening

Here, we present the derivation of a strengthening of the range predicate

$$ \textit{Result}(S,p,i,j) \Rightarrow \lnot \textit{Matches}(S,p,j+x) $$

Being more comfortable with weakening steps, we begin with the negation of part of the above range predicate, and proceed by weakening:

figure f

6 Choosing j from the Live Zone

In this section, we discuss strategies for choosing the index j (from the live zone) at which to make a match attempt. In the last algorithm, the way in which j is chosen from set \({ live}\) is nondeterministic. This leads to the situation that \({ live}\) (and, of course, \({ dead}\)) is fragmented, meaning that an implementation of the algorithm would have to maintain a set of indices for live. If we can ensure that \({ live}\) is contiguous, then an implementation would only need to keep track of the (one or two) boundary points between \({ live}\) and \({ dead}\). There are several ways to do this, and we discuss some of them in the following subsections section. Each of these represents a particular policy to be used in the selection of j.

6.1 Minimal Element—Towards the Classical Boyer-Moore Algorithm

We could use the policy of always taking the minimal element of \({ live}\). In that case, we can make some simplifications to the algorithm (which, in turn, improve the algorithm’s performance):

  • We need only store the minimal element of \({ live}\), instead of sets \({ live}\) and \({ dead}\). We use \(\widehat{{ live}}\) to denote the minimal element.

  • The dead zone update could be modified as follows: we will have considered all of the positions to the left of j and so we can ignore the negative elements of the update set:

    $$ \{\,x \mid x \in [1-|p|,0) \mathrel {\wedge }\textit{Approximation}(p,i,x)\,\} $$

    Indeed, we can just add the maximal element (which is still contiguously in the update set and greater than j) of the update set to \(\widehat{{ live}}\) for the new version of our new update of \({ live}\) and \({ dead}\).

Depending upon the choice of weakening, and the choice of match order, the above policy yields variants of the classical Boyer-Moore algorithm (see [2, 7, 8]):

figure g

6.2 Recursion

We could also devise a recursive version of the algorithm as a procedure. This procedure receives a contiguous range of live indices (\({ live}\))—initially consisting of the range \([0,|S|-|p|]\).

If the set it receives is empty, the procedure immediately returns. If the set is non-empty, j is chosen so that the resulting dead zone would appear reasonably close to the middle of the current live zoneFootnote 2. This ensures that we discard as little information as possible from the \({ nogood}\) index set. After the match attempt, the procedure recursively invokes itself twice, with the two reduced live zones on either side of the new dead zone. This yields the following procedure:

figure h

This procedure is used in the algorithm:

figure i

Naturally, for efficiency reasons, the set \({ live}\) can be represented by its minimal and maximal elements (since it is contiguous). Note that the dead zone need not be contiguous. This recursive algorithm is presented in [9], and with benchmarking data in [10].

7 Conclusions

We have shown that there are still many interesting algorithms to be derived within the field of single keyword pattern matching. The correctness preserving derivation of am entirely new family of such algorithms demonstrates the use of formal methods and the use of predicates, invariants, postconditions and preconditions. It is unlikely that such a family of algorithms could have be devised without the use of formal methods.