Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Finite automata are used in many applications in software engineering, including software verification [8] and text processing [3]. Despite their many applications, finite automata suffer from a major drawback: in the most common forms they can only handle finite and small alphabets. Symbolic automata allow transitions to carry predicates over a specified alphabet theory, such as linear arithmetic, and therefore extend finite automata to operate over infinite alphabets, such as the set of rational numbers [13]. Symbolic automata are therefore more general and succinct than their finite-alphabet counterparts. Traditional algorithms for finite automata do not always generalize to the symbolic setting, making the design of algorithms for symbolic automata challenging. A notable example appears in [11]: while allowing finite state automata transitions to read multiple adjacent inputs does not add expressiveness, in the symbolic case this extension makes problems such as checking equivalence undecidable.

Symbolic finite automata (s-FA) are closed under Boolean operations and enjoy decidable equivalence if the alphabet theory forms a decidable Boolean algebra [13]. s-FAs have been used in combination with symbolic transducers to analyze complex string and list-manipulating programs [12, 16]. In these applications it is crucial to keep the automata “small” and, in our previous work, we proposed algorithms for minimizing deterministic s-FAs  [13]. However, no algorithms have been proposed to reduce the state space of nondeterministic s-FAs (s-NFAs). While computing minimal nondeterministic automata is a hard problem [18], several techniques have been proposed to produce “small enough” automata. These algorithms compute bisimulations over the state space and use them to collapse bisimilar states [2, 26]. In this paper, we study the problem of computing forward bisimulations for s-NFAs.

While the problem of computing forward bisimulations has been studied for classic NFAs, it is not easy to adapt these algorithms to s-NFAs. Most efficient automata algorithms view the size of the alphabet as a constant and use data structures that are optimized for this view [2]. We propose three new algorithms for computing forward bisimulation of s-NFAs. First, we extend the classic Moore’s algorithm for minimizing deterministic finite automata [25] and define an algorithm that operates in quadratic time. We then adapt our previous algorithm for minimizing deterministic s-FAs  [13] to the problem of computing forward bisimulations and show that a natural implementation leads to a quadratic running time algorithm. Finally, we adapt a technique proposed by Abdulla et al. [2] to our setting, and propose a new symbolic data-structure that allows us to perform only a number of iterations that is linearithmic in the number of states and transitions. However, this improved state complexity comes at the cost of an exponential complexity in the number of distinct predicates appearing in the automaton. We compare the performance of the three algorithms on 3,625 s-FAs obtained from regular expressions and NFAs appearing in verification applications and show that, unlike for the case of deterministic s-FAs, no algorithm strictly outperforms the other ones.

Contributions. In summary, our contributions are:

  • a formal study of the notion of forward bisimulations for s-FAs and their relation to state reduction for nondeterministic s-FAs (Sect. 3);

  • three algorithms for computing forward bisimulations (Sects. 4, 5 and 6);

  • an implementation and a comprehensive evaluation of the algorithms on 3,625 s-FAs obtained from real-world applications (Sect. 7).

2 Effective Boolean Algebras and s-NFAs

We define the notion of effective Boolean algebra and symbolic finite automata. An effective Boolean algebra \({\mathcal {A}}\) has components \((U_{\!\!\,\,},\varPsi _{\!\!\,\,},[\![\_]\!]_{},\bot ,\top ,\vee ,\wedge ,\lnot )\). \(U_{\!\!\,\,}\) is a set called the universe. \(\varPsi _{\!\!\,\,}\) is a set of predicates closed under the Boolean connectives and \(\bot ,\top \in \varPsi _{\!\!\,\,}\). The denotation function \([\![\_]\!]_{}:\varPsi _{\!\!\,\,}\rightarrow 2^{U_{\!\!\,\,}}\) is such that, \([\![\bot ]\!]_{} = \emptyset \), \([\![\top ]\!]_{} = U_{\!\!\,\,}\), for all \(\varphi ,\psi \in \varPsi _{\!\!\,\,}\), \([\![\varphi \vee \psi ]\!]_{} = [\![\varphi ]\!]_{}\cup [\![\psi ]\!]_{}\), \([\![\varphi \wedge \psi ]\!]_{} = [\![\varphi ]\!]_{}\cap [\![\psi ]\!]_{}\), and \([\![\lnot \varphi ]\!]_{} = U_{\!\!\,\,}\setminus [\![\varphi ]\!]_{}\). For \(\varphi \in \varPsi _{\!\!\,\,}\), we write \(\mathbf {SAT}(\varphi )\) when \([\![\varphi ]\!]_{}\ne \emptyset \) and say that \(\varphi \) is satisfiable. \({\mathcal {A}}\) is decidable if \(\mathbf {SAT}\) is decidable.

Intuitively, such an algebra is represented programmatically as an API with corresponding methods implementing the Boolean operations and the denotation function. We are primarily going to use the following two effective Boolean algebras in the examples, but the techniques in the paper are fully generic.

 

\({\mathbf {2}}^{\mathbf {bv}k}\) :

is the powerset algebra whose domain is the finite set \(\textsc {bv}k\), for some \(k>0\), consisting of all non-negative integers smaller than \(2^{k}\)—i.e., all k-bit bit-vectors. A predicate is represented by a Binary Decision Diagram (BDD) of depth k.Footnote 1 Boolean operations correspond directly to BDD operations and \(\bot \) is the BDD representing the empty set. The denotation \([\![\beta ]\!]_{}\) of a BDD \(\beta \) is the set of all integers n such that a binary representation of n corresponds to a solution of \(\beta \).

\(\mathbf {int}{[\mathrm {k}]}\) :

is an algebra for small finite alphabets of the form \(\varSigma =\{0,\ldots , 32k-1\}\). A predicate \(\varphi \) is an array of k unsigned 32-bit integers, \(\varphi =[a_1,\ldots ,a_k]\), and for all \(i\in \varSigma \): \(i\in [\![\varphi ]\!]_{}\) iff in the integer \(a_{i/32 + 1}\) the bit in position \(i~\mathrm {mod}~32\) is 1. Boolean operations can be performed efficiently using bit-vector operations. For example, the conjunction \([a_1,\ldots ,a_k]\wedge [b_1,\ldots ,b_k]\) corresponds to \( [a_1 \& b_1,\ldots ,a_k \& b_k]\), where & is the bit-wise and of two integers.

 

We can now define symbolic finite automata. Intuitively, a symbolic finite automaton is a finite automaton over a symbolic alphabet, where edge labels are replaced by predicates. In order to preserve the classical Boolean closure operations (intersection, complement, and union) over languages, the predicates must also form a Boolean algebra. Since the core topic of the paper is about nondeterministic automata we adopt the convention often used in studies of NFAs [10, 22, 28] that an automaton has a set of initial states rather than a single initial state as used in other literature on automata theory [21].

Definition 1

A symbolic nondeterministic finite automaton (s-NFA) M is a tuple \(({\mathcal {A}},Q,I,F,\varDelta )\) where \({\mathcal {A}}\) is an effective Boolean algebra, called the alphabet, Q is a finite set of states, \(I\subseteq Q\) is the set of initial states, \(F\subseteq Q\) is the set of final states, and \(\varDelta \subseteq Q\times \varPsi _{\!\!\mathcal {A}}\times Q\) is a finite set of moves or transitions.

Elements of \(U_{\!\!\mathcal {A}}\) are called characters and finite sequences of characters, elements of \(U_{\!\!\mathcal {A}}^*\), are called words; \(\epsilon \) denotes the empty word. A move \(\rho =(p,\varphi ,q)\in \varDelta \) is also denoted by \(p\xrightarrow {\varphi }_M q\) (or \(p\xrightarrow {\varphi } q\) when M is clear from the context), where p is the source state, q is the target state, and \(\varphi \) is the guard or predicate of the move. Given a character \(a\in U_{\!\!\mathcal {A}}\), an a-move of M is a tuple (paq) such that \(p\xrightarrow {\varphi }_M q\) and \(a\in [\![\varphi ]\!]_{}\), also denoted \(p\xrightarrow {a}_M q\) (or \(p\xrightarrow {a} q\) when M is clear). In the following let \(M=({\mathcal {A}},Q,I,F,\varDelta )\) be an s-NFA.

Definition 2

Given a state \(p\in Q\), the (right) language of p in M, denoted \({\mathscr {L}}(p,M)\), is the set of all \(w=[a_i]_{i=1}^k\in U_{\!\!\mathcal {A}}^*\) such that, either \(w=\epsilon \) and \(p\in F\), or \(w\ne \epsilon \) and there exist \(p_{i-1}\xrightarrow {a_{i}}_M p_{i}\) for \(1\le i\le k\), such that \(p_0=p\), and \(p_k\in F\). The language of M is \({\mathscr {L}}_{}(M)\mathop {=}\limits ^{\mathrm {\tiny def}}\bigcup _{q\in I}{\mathscr {L}}(q,M)\). Two states p and q of M are indistinguishable if \({\mathscr {L}}(p,M) = {\mathscr {L}}(q,M)\). Two s-NFAs M and N are equivalent if \({\mathscr {L}}_{}(M)={\mathscr {L}}_{}(N)\).

The following terminology is used to characterize various key properties of M. A state \(p\in Q\) is called complete if for all \(a\in U_{\!\!\mathcal {A}}\) there exists an a-move from p, p is partial otherwise. A move is feasible if its guard is satisfiable.

  • M is deterministic: \(|I|=1\) and whenever \(p\xrightarrow {a} q\) and \(p\xrightarrow {a} q'\) then \(q=q'\).

  • M is complete: all states of M are complete; M is partial, otherwise.

  • M is clean: all moves of M are feasible.

  • M is normalized: for all \((p,\varphi ,q),(p,\psi ,q)\in \varDelta \): \(\varphi = \psi \).

  • M is minimal: there exists no equivalent s-NFA with fewer states.

In the following, we always assume that M is clean. If E is an equivalence relation over Q, then, for \(q\in Q\), \(q_{{/}E}\) denotes the E-equivalence class containing q, for \(X\subseteq Q\), \(X_{{/}E}\) denotes \(\{q_{{/}E}\mid q \in X\}\). The E-quotient of M is the s-NFA

$$\begin{aligned} M_{{/}E}&\mathop {=}\limits ^{\mathrm {\tiny def}}\,\, ({\mathcal {A}},Q_{{/}E},I_{{/}E}, F_{{/}E}, \{(p_{{/}E},\varphi , q_{{/}E}) \mid (p,\varphi ,q)\in \varDelta \}) \end{aligned}$$

3 Forward Bisimulations

Here we adapt the notion of forward bisimulation to s-NFAs. Below, consider a fixed s-NFA \(M =({\mathcal {A}},Q,I,F,\varDelta )\).

Definition 3

Let \(E\subseteq Q\times Q\) be an equivalence relation. E is a forward bisimulation on M when, for all \((p,q)\in E\), if \(p\in F\) then \(q\in F\), and, for all \(a\in U_{\!\!\mathcal {A}}\) and \(p'\in Q\), if \(p\xrightarrow {a} p'\) then there exists \(q'\in p'_{{/}E}\) such that \(q\xrightarrow {a} q'\).

If E is a forward bisimulation on M then the quotient \(M_{{/}E}\) preserves the language of all states in M, as stated formally by Theorem 1, as a generalization of the same property known in the classical case when the alphabet is finite.

Theorem 1

Let E be a forward bisimulation on M. Then, for all states q of M, \({\mathscr {L}}(q,M) = {\mathscr {L}}(q_{{/}E},M_{{/}E})\).

Proof

We prove the statement \(\phi (w)\) by induction over |w| for \(w\in U_{\!\!\mathcal {A}}^*\):

$$\begin{aligned} \phi (w) : \forall p\in Q_M(w \in {\mathscr {L}}(p,M) \Leftrightarrow w\in {\mathscr {L}}(p_{{/}E},M_{{/}E})) \end{aligned}$$

The base case \(|w|=0\) follows from the property of the forward bisimulation E on M that if \(p\in F\) then \(p_{{/}E}\subseteq F\) and by definition of E-quotient of M that its set of final states is \(F_{{/}E}\).

For the induction case assume that \(\phi (w)\) holds as the IH. Let \(a\in U_{\!\!\mathcal {A}}\). We prove \(\phi (a\cdot w)\). Fix \(p\in Q_M\).

$$\begin{aligned} \begin{array}{lrcl} &{} a\cdot w \in {\mathscr {L}}(p,M) &{} \Leftrightarrow &{} \exists q\in Q\;\text {such that}\;(p\xrightarrow {a}_Mq,\; w \in {\mathscr {L}}(q,M)) \\ &{} &{} \mathop {\Leftrightarrow }\limits ^{\text {by IH}} &{} \exists q\in Q\;\text {such that}\;(p\xrightarrow {a}_Mq,\; w \in {\mathscr {L}}(q_{{/}E},M_{{/}E})) \\ &{} &{} \mathop {\Leftrightarrow }\limits ^{(*)} &{} \exists q\in Q\;\text {such that}\;(p_{{/}E}\xrightarrow {a}_{M_{{/}E}}q_{{/}E},\; w \in {\mathscr {L}}(q_{{/}E},M_{{/}E})) \\ &{} &{} \Leftrightarrow &{} a\cdot w \in {\mathscr {L}}(p_{{/}E},M_{{/}E}) \end{array} \end{aligned}$$

Proof of \((*)\):

\((\Rightarrow )\): If \(p\xrightarrow {a}_Mq\) then there is \((p,\varphi ,q)\in \varDelta _M\) such that \(a\in [\![\varphi ]\!]_{}\). By definition of \(M_{{/}E}\), there is \((p_{{/}E},\varphi ,q_{{/}E})\in \varDelta _{M_{{/}E}}\), hence \(p_{{/}E}\xrightarrow {a}_{M_{{/}E}}q_{{/}E}\).

\((\Leftarrow )\): Fix a q such that \(p_{{/}E}\xrightarrow {a}_{M_{{/}E}}q_{{/}E}\) and \(w \in {\mathscr {L}}(q_{{/}E},M_{{/}E})\). By definition of \(\varDelta _{M_{{/}E}}\) there exists a transition \((p_1,\alpha ,q_1)\) in \(\varDelta _M\) where \(a\in [\![\alpha ]\!]_{}\) and \({p_1}_{{/}E}= p_{{/}E}\) and \({q_1}_{{/}E}= q_{{/}E}\), so \(p_1\xrightarrow {a}_Mq_1\). By the assumption that E is a bisimulation on M it follows that there exists \(q'\in {q_1}_{{/}E}\) such that \(p\xrightarrow {a}_M q'\). But \({q_1}_{{/}E} = q_{{/}E}\), so \(q'_{{/}E} = q_{{/}E}\) and therefore \(\exists q'\in Q\;\text {such that}\;(p\xrightarrow {a}_M q',\; w \in {\mathscr {L}}(q'_{{/}E},M_{{/}E}))\). \(\boxtimes \)

Corollary 1

Let E be a forward bisimulation on M. Then \({\mathscr {L}}_{}(M) = {\mathscr {L}}_{}(M_{{/}E})\).

For a deterministic s-NFA M one can efficiently compute the coarsest forward bisimulation relation \(\equiv _M\) over \(Q_M\) defined by indistinguishability of states, in order to construct \(M_{{/}\equiv _M}\) as the minimal canonical (up to equivalence of predicates) deterministic s-NFA that is equivalent to M [13, Theorem 2]. The nondeterministic case is much more difficult because there exists, in general, no canonical minimal NFA [22] for a given regular language.

Our aim in this paper is to study algorithms for computing forward bisimulations for s-NFAs. Once a forward bisimulation E has been computed for an s-NFA M, it can be applied, according to Corollary 1, to build the equivalent E-quotient \(M_{{/}E}\) with reduced number of states, \(M_{{/}E}\) need not be minimal though.

4 Symbolic Partition Refinement

We start by presenting the high-level idea of symbolic partition refinement for forward bisimulations as an abstract algorithm. Let the given s-NFA be \(M =({\mathcal {A}},Q,I,F,\varDelta )\). It is convenient to view \(\varDelta \), without loss of generality, as a function from \(Q\times Q\) to \(\varPsi _{\!\!{\mathcal {A}}}\), and we also lift the definition over its second argument to subsets \(S\subseteq Q\) of states,

where the predicates are effectively constructed using \(\vee _{\mathcal {A}}\). Essentially, this view of \(\varDelta \) corresponds to M being normalized, where all pairs (pq) such that there is no transition from p to q have \(\varDelta (p,q)=\bigvee \emptyset \mathop {=}\limits ^{\mathrm {\tiny def}}\bot \), else the guard of the transition from p to q is \(\varDelta (p,q)\). The predicate \(\varDelta (p,S)\) denotes the set of all those characters that transition from p to some state in S.

M is assumed to be nontrivial, so that both F and \(Q{\setminus } F\) are nonempty. We construct partitions \({\mathcal {P}}_i\) of Q such that \({\mathcal {P}}_i\) is a refinement of \({\mathcal {P}}_{i-1}\) for \(i\ge 1\), i.e., each block in \({\mathcal {P}}_i\) is a subset of some block in \({\mathcal {P}}_{i-1}\). Initially let

$$\begin{aligned} {\mathcal {P}}_0 = \{Q\},\;{\mathcal {P}}_1 = \{F, Q{\setminus } F\}. \end{aligned}$$

For a partition \({\mathcal {P}}\) of Q define \(\sim _{{\mathcal {P}}}\) as the following equivalence relation over Q:

$$\begin{aligned} p \sim _{{\mathcal {P}}} q\mathop {=}\limits ^{\mathrm {\tiny def}}\exists B\in {\mathcal {P}}\;\text {such that}\;(p,q\in B). \end{aligned}$$

Let \({\sim _{i}}\mathop {=}\limits ^{\mathrm {\tiny def}}{\sim _{{\mathcal {P}}_i}}\). The partition \({\mathcal {P}}_i\) is refined until \({\mathcal {P}}_{n+1} = {\mathcal {P}}_n\) for some \(n\ge 1\). Each such refinement step maintains the invariant (1) for \(i\ge 1\) and \(p,q\in Q\):Footnote 2

$$\begin{aligned} p \sim _{{i+1}} q \Longleftrightarrow p \sim _{i} q\,\,\text {and for all } B \in {\mathcal {P}}_i: [\![\varDelta (p,B)]\!]_{}=[\![\varDelta (q,B)]\!]_{} \end{aligned}$$
(1)

Under the assumption that \({\mathcal {A}}\) is decidable, \([\![\varDelta (p,B)]\!]_{}=[\![\varDelta (q,B)]\!]_{}\) can be decided by checking that \(\varDelta (p,B)\nLeftrightarrow \varDelta (q,B)\) is unsatisfiable.Footnote 3 So \({\mathcal {P}}_{i+1}\) can be computed effectively from \({\mathcal {P}}_i\) and iterating this step provides an abstract algorithm for computing the fixpoint \({\sim _M} \mathop {=}\limits ^{\mathrm {\tiny def}}{\sim _{{\mathcal {P}}_n}}\) such that \({\mathcal {P}}_{n+1}= {\mathcal {P}}_n\).

Theorem 2

\(\sim _M\) is the coarsest forward bisimulation on M.

Proof

Let \({\sim } = {\sim _M}\). We show first that \(\sim \) is a forward bisimulation on M by way of contradiction. Suppose that \(\sim \) is not a forward bisimulation on M. Since \(p\sim _1 q\) iff \(p,q\in F\) or \(p,q\notin F\), and \(\sim \) refines \(\sim _1\), the condition that for \(p\sim q\) if \(p\in F\) then \(q\in F\) holds. Therefore, there must exists \(p\sim q\) such that for some \(a\in U_{\!\!\mathcal {A}}\) and \(p'\in Q\) we have \(p\xrightarrow {a}p'\), while for all \(q'\) such that \(q\xrightarrow {a}q'\) we have \(q'\not \sim p'\). Hence there is \(B\in {\mathcal {P}}_i\) for some \(i\ge 1\), namely \(B = p'_{{/}\sim }\), such that \(a\in [\![\varDelta (p,B)]\!]_{}\) but \(a\notin [\![\varDelta (q,B)]\!]_{}\), so \([\![\varDelta (p,B)]\!]_{}\ne [\![\varDelta (q,B)]\!]_{}\). But then \(p\not \sim _{i+1} q\), contradicting that \(p\sim q\). So \(\sim \) is a forward bisimulation on M.

Next, consider any bisimulation \(\simeq \) on M. We show that \({\simeq }\subseteq {\sim _i}\) for all \(i\ge 1\).

Base case. Suppose \(p\simeq q\). If \(p\in F\) then \(q\in F\), by Definition 3, and, since \(\simeq \) is an equivalence relation, symmetrically, if \(p\notin F\) then \(q\notin F\). So \(p\sim _{1} q\).

Induction case. Assume as the IH that \({\simeq }\subseteq {\sim _i}\). We prove that \({\simeq }\subseteq {\sim _{i+1}}\). Suppose \(p\simeq q\). We show that \(p\sim _{i+1}q\). By using the IH, we have that \(p\sim _i q\). By using Eq. (1), we need to show that for all \(B\in {\mathcal {P}}_i\), \([\![\varDelta (p,B)]\!]_{}=[\![\varDelta (q,B)]\!]_{}\). By way of contradiction, suppose there exists \(B\in {\mathcal {P}}_i\) such that \([\![\varDelta (p,B)]\!]_{}\ne [\![\varDelta (q,B)]\!]_{}\). Then, w.l.o.g., there exists \(a\in U_{\!\!\mathcal {A}}\) and \(p'\in B\) such that \(p\xrightarrow {a}p'\), and for all \(q'\in Q\) if \(q\xrightarrow {a}q'\) then \(q'\notin B\), i.e., \(q'\not \sim _{i}p'\), and by using the contrapositive of the IH (\({\not \sim _{i}}\subseteq {\not \simeq }\)) we have \(q'\not \simeq p'\). But then \(p\xrightarrow {a}p'\) while there is no \(q'\in p'_{{/}\simeq }\) such that \(q\xrightarrow {a}q'\), contradicting, by Definition 3, that \(p\simeq q\). Thus, for all \(B\in {\mathcal {P}}_i\), \([\![\varDelta (p,B)]\!]_{}=[\![\varDelta (q,B)]\!]_{}\). So \(p\sim _{i+1}q\).

It follows that \({\simeq }\subseteq {\sim }\) which proves that \({\sim }\) is coarsest. \(\boxtimes \)

A simple algorithm for computing \(\sim _M\) is shown in Fig. 1. It differs from the abstract algorithm in that the partition is refined in smaller increments, rather than in large parallel refinement steps corresponding to Eq. (1). The order of such steps does not matter as long as progress is made at each step.

Fig. 1.
figure 1

Simple and greedy algorithms for computing \(\sim _M\).

Theorem 3

\( SimpleBisimSFA (M)\) computes \(\sim _M\).

Proof

(outline). The key observation is the following: if \([\![\varDelta (q,B)]\!]_{} \ne [\![\varDelta (r,B)]\!]_{}\) holds for some \(q\sim _{\mathcal {P}}r\) and \(B\in {\mathcal {P}}\) and B has been split into \(\{B_i\}_{i=1}^n\) before it has been chosen from the workset then \([\![\varDelta (q,B_i)]\!]_{} \ne [\![\varDelta (r,B_i)]\!]_{}\) for some i, or else \([\![\varDelta (q,B)]\!]_{} = \bigcup _i[\![\varDelta (q,B_i)]\!]_{} = \bigcup _i[\![\varDelta (r,B_i)]\!]_{} = [\![\varDelta (r,B)]\!]_{}\). In other words, even if B has not yet been used as a splitter, the fact that \(q\not \sim _M r\) holds will be detected at some later point using one of the blocks \(B_i\) because all subblocks are added to the workset W.

The splitting of B into D and \(B{\setminus } D\) requires some explanation. First note that \(q\in D\) and \(r \in B{\setminus } D\), so both new blocks are nonempty. Second, pick any \(p\in D\) and any \(s\in B{\setminus } D\). We need to show that \([\![\varDelta (p,R)]\!]_{}\ne [\![\varDelta (s,R)]\!]_{}\) to justify the split. We know that \(\mathbf {SAT}(\varDelta (p,R)\wedge \varDelta (q,R)\wedge \lnot \varDelta (r,R))\) holds. Thus, if \(\varDelta (p,R)\) were equivalent to \(\varDelta (s,R)\) then \(\mathbf {SAT}(\varDelta (s,R)\wedge \varDelta (q,R)\wedge \lnot \varDelta (r,R))\) would also hold, contradicting that \(s\notin D\).

It follows that upon termination, when \(W=\emptyset \), \({\mathcal {P}}\) cannot be refined further and thus \({\sim _{\mathcal {P}}} = {\sim _M}\). \(\boxtimes \)

Complexity. If the complexity of checking satisfiability of predicates of size \(\ell \) is \(f(\ell )\), then \( SimpleBisimSFA (M)\) has complexity \({\mathcal {O}}(mn f(n\ell ))\), where m is the number of transitions in the input s-FA, n is the number of states, and \(\ell \) is the size of the largest predicate in the input s-FA.Footnote 4 Since we check satisfiability by taking the union of all predicates in multiple transition (e.g., \(\varDelta (q,R)\)), satisfiability checks are performed on predicates of size \({\mathcal {O}}(n\ell )\).

5 Greedy Symbolic Partition Refinement

We can improve the simple algorithm by incorporating Hoprcoft’s “keep the smaller half” partition refinement strategy [19]. This strategy is also reused in Paige-Tarjan’s relational coarsest partition algorithm [26]. Hopcroft’s strategy is generalized to symbolic alphabets in [13] by incorporating the idea of using symmetric differences of character predicates during partition refinement, instead of single characters, as illustrated also in the simple algorithm. Here we further generalize the algorithm from [13] to s-NFAs. The algorithm can also be seen as a generalization of Paige-Tarjan’s relational coarsest partition algorithm from computing the coarsest forward bisimulation of an NFA to that of an s-NFA.

The greedy algorithm is shown in Fig. 1. The computation of partition \({\mathcal {P}}\) is altered in such a way that whenever a block B (that is no longer, or never was, in the workset W) is split into D and \(B{\setminus }D\), only the smaller of the two halves is added to the workset. In order to preserve correctness, the original \(\mathbf {SAT}\) condition involving R must be augmented with a corresponding condition involving \(R'=\textsc {super}(R){\setminus }R\), where \(\textsc {super}(R)\) is the block that contained R before splitting. This means that the other half will also participate in the splitting process. The gain is how efficiently the information computed for a block is reused in the computation. The core difference to the deterministic case [13] is that if M is deterministic then the use of \(R'\) is redundant, i.e., the \(\mathbf {SAT}\) check holds for R iff it holds for \(\textsc {super}(R){\setminus }R\), so the superblock mapping is not needed.

Fig. 2.
figure 2

Sample NFA.

Example 1

This example illustrates why the additional \(\mathbf {SAT}\)-checks on \(\textsc {super}(R){\setminus }R\) are needed in the greedy algorithm, when M is nondeterministic. Let M be the NFA in Fig. 2, where \(U_{\!\!\mathcal {A}}=\{a\}\). Then initially \(W = \{\{f\}\}\) and \({\mathcal {P}}= \{\{q,r\},\{f\}\}\). So, in the first iteration \(R=\{f\}\). Let \(R' = \textsc {super}(R){\setminus }R = \{q,r\}\). The only candidate block for B is \(\{q,r\}\). \(\mathbf {SAT}(\varDelta (q,R)\wedge \lnot \varDelta (r,R))\) fails because \([\![\varDelta (q,R)]\!]_{} = [\![\varDelta (r,R)]\!]_{} = \{a\}\), while \([\![\varDelta (q,R')]\!]_{} = \{a\}\) and \([\![\varDelta (r,R')]\!]_{} = \emptyset \). Thus, if \(\mathbf {SAT}(\varDelta (q,R')\wedge \lnot \varDelta (r,R'))\) was omitted then the algorithm would return \(\sim _{\{\{q,r\},\{f\}\}}\) but \(q\not \sim _M r\). \(\boxtimes \)

Theorem 4

\( GreedyBisimSFA (M)\) computes \(\sim _M\).

Proof

(outline). The justification behind splitting of B into D and \(B{\setminus } D\) based on R or \(\textsc {super}(R){\setminus }R\) is analogous to the argument provided in the proof of Theorem 3. We show that no splits are missed due to the additional optimization.

In the case a block B in W has not yet been used as a splitter, its original superblock \(B^s=\textsc {super}(B)\) must be kept as the superblock of the new subblocks D and \(B{\setminus }D\). This implies that blocks \(B^s{\setminus }D\) and \(B^s{\setminus }(B{\setminus }D)\) serve as the replacement candidate splitters for the block \(B^s{\setminus }B\). In the case a block B is not in W, its use as a splitter is already covered, and it serves as the superblock for its subblocks D and \(B{\setminus }D\), i.e., \(\textsc {super}(D) = B\) and \(\textsc {super}(B{\setminus }D) = B\), which implies that \(\textsc {super}(D){\setminus }D = B{\setminus }D\) and \(\textsc {super}(B{\setminus }D){\setminus }(B{\setminus }D) = D\). \(\boxtimes \)

Complexity. If the complexity of checking satisfiability of predicates of size \(\ell \) is \(f(\ell )\), the naive implementation of \( GreedyBisimSFA (M)\) presented in Fig. 1, which explicitly computes \(\varDelta (r,\textsc {super}(R){\setminus }R)\), has complexity \({\mathcal {O}}(mn f(n\ell ))\), with m as the number of transitions in the input s-FA and n as the number of states. Even though only the small block is added to added to W after a split, both blocks are eventually visited. Therefore, we still have a quadratic complexity as n and m are multiplied. In the next section, we discuss a different data structure that yields a different complexity for the greedy algorithm in Fig. 1.

6 Counting Symbolic Partition Refinement

We want to avoid explicit computation of \(\varDelta (p,\textsc {super}(R){\setminus }R)\) in the greedy algorithm. We investigate a method that can reuse the computation performed for \(\textsc {super}(R)\) and R in order to calculate \(\varDelta (p,\textsc {super}(R){\setminus }R)\). We consider a symbolic bag datastructure that, by using predicates in \(\varPsi _{\!\!{\mathcal {A}}}\), provides a finite partition for \(U_{\!\!\mathcal {A}}\) and maps each part in the partition into a natural number. A (symbolic) bag \(\sigma \) denotes a function \([\![\sigma ]\!]_{}\) from \(U_{\!\!\mathcal {A}}\) to \({\mathbb {N}}\) that has a finite range. All elements that map to the same number effectively define a part or block of the partition. For \(p\in Q\) and \(S\subseteq Q\) let \(\textit{Bag}(p,S)\) be a bag such that, for all \(a\in U_{\!\!\mathcal {A}}\),

$$\begin{aligned}{}[\![\textit{Bag}(p,S)]\!]_{}(a) = |\{q\in S\mid p\xrightarrow {a}q\}|. \end{aligned}$$

In other words, in addition to encoding if a character a can reach S from p, the bag also encodes, to how many different target states. Let \(\textit{Set}\) be a function that transforms bags \(\sigma \) to predicates in \(\varPsi _{\!\!A}\) such that

$$\begin{aligned}{}[\![\textit{Set}(\sigma )]\!]_{} = \{a\in U_{\!\!\mathcal {A}}\mid [\![\sigma ]\!]_{}(a) > 0\} \end{aligned}$$

In particular \([\![\textit{Set}(\textit{Bag}(p,S))]\!]_{} = [\![\varDelta (p,S)]\!]_{}\). A bag can be implemented effectively in several ways and we defer the discussion of such choices to below. We assume that there is an effective difference operation over bags such that, for all \(a\in U_{\!\!\mathcal {A}}\), given , So

This shows that each \(\varDelta (p,X)\) in the greedy algorithm can be represented using a symbolic bag. The potential advantage is, provided that we can efficiently implement the difference and the \(\textit{Set}\) operations, that in the computation of we can reuse the prior computations of \(\textit{Bag}(p,\textsc {super}(R))\) and \(\textit{Bag}(p,R)\), and therefore do not need \(\textsc {super}(R){\setminus }R\).

We call the instance of the greedy algorithm that uses symbolic bags, the counting algorithm or \( CountingBisimSFA \). The counting algorithm is a generalization of the bisimulation based minimization algorithm of NFAs [2] from using algebraic decision diagrams (ADDs) [4] and binary decision diagrams (BDDs) [9] for representing multisets ands sets of characters, to symbolic bags and predicates. If the size of the alphabet is \(k = 2^p\) then p is the depth or the number of bits required in the ADDs. An open problem for symbolic bags is to maintain an equally efficient data structure. Although theoretically p is bounded by the number of predicates in the s-NFA, the actual computation of those bits and their relationship to the predicates of the s-NFA requires that the s-NFA is first transformed into an NFA. However, the NFA transformation has complexity \(O(2^p)\). This factor is also reflected in the complexity of the algorithm in [2] that is \(O(k m\,\text {log}\, n)\) with k, m and n as above.

Implementation. We define symbolic bags over \({\mathcal {A}}\), denoted \( Bag _{\!\!{\mathcal {A}}}\), as the least set of expressions that satisfies the following conditions.

  • If \(n\in {\mathbb {N}}\) then \({{\varvec{nat}}(n)}\in Bag _{\!\!{\mathcal {A}}}\).

  • If \(\varphi \in \varPsi _{\!\!{\mathcal {A}}}\) and \(\sigma ,\tau \in Bag _{\!\!{\mathcal {A}}}\) then \({{\varvec{ite}}(\varphi ,\sigma ,\tau )}\in Bag _{\!\!{\mathcal {A}}}\).

The denotation of a bag \(\sigma \) is a function \([\![\sigma ]\!]_{}:U_{\!\!\mathcal {A}}\rightarrow {\mathbb {N}}\) such that, for all \(a\in U_{\!\!\mathcal {A}}\),

We say that a symbolic bag is clean if all paths from the root to any of its leaves is satisfiable. In our operations over bags we maintain cleanness. An operator \(\diamond \), such as \(+\) or , over \({\mathbb {N}}\) is lifted to bags as follows.

$$\begin{aligned} \sigma \diamond \tau&\mathop {=}\limits ^{\mathrm {\tiny def}}\sigma \diamond _{\top } \tau \\ {{\varvec{nat}}(m)} \diamond _{\gamma } {{\varvec{nat}}(n)}&\mathop {=}\limits ^{\mathrm {\tiny def}}{{\varvec{nat}}(m\diamond n)} \\ {{\varvec{ite}}(\varphi ,\sigma ,\tau )} \diamond _{\gamma } \rho&\mathop {=}\limits ^{\mathrm {\tiny def}}{{\varvec{ite}}(\varphi ,\sigma \diamond _{\gamma \wedge \varphi }\rho ,\tau \diamond _{\gamma \wedge \lnot \varphi }\rho )} \\ {{\varvec{nat}}(n)} \diamond _{\gamma } {{\varvec{ite}}(\varphi ,\sigma ,\tau )}&\mathop {=}\limits ^{\mathrm {\tiny def}}\left\{ \begin{array}{l} {{\varvec{nat}}(n)} \diamond _{\gamma } \tau ,\; \text {if not }\,\, \mathbf {SAT}(\gamma \wedge \varphi );\\ {{\varvec{nat}}(n)} \diamond _{\gamma } \sigma ,\; \text {else if not }\,\, \mathbf {SAT}(\gamma \wedge \lnot \varphi );\\ {{\varvec{ite}}(\varphi ,{{\varvec{nat}}(n)} \diamond _{\gamma \wedge \varphi }\sigma ,{{\varvec{nat}}(n)} \diamond _{\gamma \wedge \lnot \varphi } \tau )},\; \text {otherwise.} \end{array} \right. \end{aligned}$$

Cleaning of the result is done incrementally during construction by passing the context condition \(\gamma \) with the operator \(\diamond _{\gamma }\). Observe that if \(\alpha \wedge \beta \) is unsatisfiable (i.e., \([\![\alpha ]\!]_{}\cap [\![\beta ]\!]_{}=\emptyset \)) then \(\alpha \) implies \(\lnot \beta \) (i.e., \([\![\alpha ]\!]_{}\subseteq [\![\lnot \beta ]\!]_{}\)). For all \(p,q\in Q\) let

$$\begin{aligned} \textit{Bag}(p,q) \mathop {=}\limits ^{\mathrm {\tiny def}}\left\{ \begin{array}{ll} {{\varvec{ite}}(\varDelta (p,q),{{\varvec{nat}}(1)},{{\varvec{nat}}(0)})}, &{}\text {if}\,\,\varDelta (p,q)\ne \bot ;\\ {{\varvec{nat}}(0)}, &{}\text {otherwise.} \end{array} \right. \end{aligned}$$

Let \(\textit{Bag}(p,R)\mathop {=}\limits ^{\mathrm {\tiny def}}\sum _{q\in R}\textit{Bag}(p,q)\). One additional simplification that is performed is that if \([\![\sigma ]\!]_{}=[\![\tau ]\!]_{}\) then the expression \({{\varvec{ite}}(\varphi ,\sigma ,\tau )}\) is simplified to \(\sigma \). The \(\textit{Set}(\sigma )\) operation replaces each non-zero leaf in \(\sigma \) with \(\top \) and each zero leaf in \(\sigma \) with \(\bot \), assuming, w.l.o.g., that \({\mathcal {A}}\) has the corresponding operator \({{\varvec{ite}}(\varphi ,\psi ,\gamma )}\) with the expected semantics that \([\![{{\varvec{ite}}(\varphi ,\psi ,\gamma )}]\!]_{} = [\![(\varphi \wedge \psi )\vee (\lnot \varphi \wedge \gamma )]\!]_{}\).

Example 2

Consider an s-NFA M with alphabet \({\mathcal {A}}\) such that \(U_{\!\!\mathcal {A}}= {\mathbb {N}}\) that has the following transitions from a given state p: \(\{p\xrightarrow {\phi _{2}}q_2, p\xrightarrow {\phi _{3}}q_3,p\xrightarrow {\phi _{6}}q_6\}\) where \(\phi _{k}\) for \(k\ge 1\) is a predicate such that \(n\in [\![\phi _{k}]\!]_{}\) iff n is divisible by k. In the following \({{\varvec{ite}}(\varphi ,l,r)}\) is depicted with \(\varphi \) as the node, l as the left subtree, and r as the right subtree. Let \(R=\{q_2,q_3,q_6\}\). Then \(\textit{Bag}(p,R) = \textit{Bag}(p,q_2)+\textit{Bag}(p,q_3) + \textit{Bag}(p,q_6)\) is computed as follows:

In the second addition, all the branch conditions of the leaves of the first tree, other than the first branch, become unsatisfiable with the condition \(\phi _{6}\). Only the very first branch condition \(\phi _{2}\,\wedge \,\phi _{3}\) is consistent (in this case equivalent) with \(\phi _{6}\) while \({{\varvec{nat}}(0)}\) is the identity. Hence \({{\varvec{nat}}(3)} = {{\varvec{nat}}(2)} + {{\varvec{nat}}(1)}\) in t. \(\boxtimes \)

Complexity. In this implementation, \(\varDelta (r,B)\) is represented by \(\textit{Set}(\textit{Bag}(r,B))\), and \(\varDelta (r,\textsc {super}(R){\setminus }R)\) can be computed from \(\textit{Bag}(r,\textsc {super}(R))\) and \(\textit{Bag}(r,R)\) without having to iterate over the automaton transitions. However, in the worst case, at each step in the algorithm, the Bag data structure can have exponential size in p, the number of distinct predicates in the s-FA. Using a similar amortized complexity argument to that used by Hopcroft’s algorithm for minimizing DFAs [20], we have that, if we ignore the cost of computing the bag data structure, the algorithm has complexity \({\mathcal {O}}(m\log n)\). In summary, if the complexity of checking satisfiability of predicates of size \(\ell \) is \(f(\ell )\), the counting implementation of \( GreedyBisimSFA (M)\) presented in Fig. 1 has complexity \({\mathcal {O}}(2^p m\log n f(n\ell ))\), where m is the number of transitions in the input s-FA and n is the number of states, and p is the number of distinct predicates in the automaton. Concretely, while this implementation helps reducing the number of iterations over the automaton transitions, it suffers from an extra cost that is a function of the alphabet complexity and of the predicates appearing in the automaton. Notice, that in the case of finite alphabets \(2^p\) is exactly the size of the alphabet and this problem does not exist [2]. This is another remarkable case of how adapting classic algorithms to the symbolic setting is not always possible.

7 Evaluation

We evaluate our algorithms on two sets of benchmarks. We report the state reduction obtained using forward bisimulations and, for each algorithm, we compare the running times and the number of explored blocks. We use Simple to denote the algorithm presented at the top of Fig. 1, Greedy to denote the algorithm presented in Sect. 5, and Count to denote the counting based algorithm described in Sect. 6. As a sanity check, we assured that all the algorithms computed the same results. All the experiments were run on a 4-core Intel i7-2600 CPU 3.40 GHz, with 8 GB of RAM.

Fig. 3.
figure 3

State reduction for the two benchmark sets.

Regexlib. We collected the s-NFAs over the alphabet \({\mathbf {2}}^{\textsc {bv16}}\) resulting from converting 2,625 regular expressions appearing in http://regexlib.com/. This website contains a library of crowd-sourced regular expressions for tasks such as detecting URLs, emails, and phone numbers. These s-NFAs have 1 to 3,174 states, 1 to 10,670 transitions, and have an average of 2 transitions per state. These benchmarks operate over very large alphabets and can only be handled symbolically. We use the algebra \({\mathbf {2}}^{\textsc {bv}k}\).

Verification s-NFAs. We collected 1,000 s-NFAs over small alphabets (2-40 symbols) appearing in verification applications from [8]. These s-NFAs are generated from the steps of abstract regular model checking while verifying the bakery algorithm, a producer-consumer system, bubble sort, an algorithm that reverses a circular list, and a Petri net model of the readers/writers protocol. These s-FAs have 4 to 3,782 states, 7 to 18,670 transitions, and have an average of 4.1 transitions per state. Given the small size of the alphabets, these automata are quite dense. We represent the alphabet using the algebra \(\mathbf {int}[\mathrm {k}]\).

State Reduction. Figure 3 shows the state reduction obtained by our algorithm. Each point (xy) in the figure shows that an automaton with x states was reduced to an equivalent automaton with y states. On average, the number of states reduces by 14% and 19% for the regexlib benchmarks and the verification NFAs respectively.

Fig. 4.
figure 4

Running times of three algorithms on regular expression from www.regexlib.com and on NFAs from verification applications. In the second plot, we do not show data points that are very close to each other to make the figure readable.

Runtime. Figure 4 shows the running times of the algorithms on each benchmark s-FA. For the regexlib s-FAs, most automata take less than 1ms to complete causing the same running time for the three algorithms on 2528 benchmarks. In general, the Greedy algorithm is slightly faster than the other two algorithms and the Count algorithm is at times slower than both the other two algorithms (93 cases total), on relatively small cases. On two large instances (1,502 and 3,174 states, 1,502 and 10,670 transitions) the Greedy and Count algorithms clearly outperform the Simple algorithm.

For s-FAs from [8], the algorithms Simple and Greedy, have very comparable performances (Greedy is, on average, 6 ms slower than Simple). The Count algorithm is slower than both these algorithms in 90% of the cases and has the same performance in the remaining 10% of the cases.

In both experiments, almost all the computation time of the Count algorithm is spent manipulating the counting data structure presented in Sect. 6. In summary, the Count algorithm, despite having \(m\log n\) complexity, is consistently slower than the other two algorithms and the slowdown is due to the complexity of manipulating the counting data structure.

Fig. 5.
figure 5

Ratio of number of explored blocks between the simple algorithm and the other algorithms.

Explored Blocks. We measure the number of blocks pushed into the worklist W for the different algorithms. Figure 5 shows the ratio between the explored blocks of the Simple algorithm and the other two algorithms. As expected from the theoretical complexities, the Count algorithm consistently explores fewer blocks than the Simple algorithm. As we observed in Fig. 4, this is not enough to achieve better speedups. The Greedy algorithm often explores more blocks than the other two algorithms. This is because \(R' = \textsc {super}(R){\setminus }R\) of a set R is explored even in the cases where \(R'\) has already been split into subsets. In this case, the simple algorithm will only explore the splits and not the original set, while the Greedy algorithm will explore both \(R'\) as well as its splits.

8 Related Work

Minimization of Deterministic Automata. Automata minimization algorithms have been studied and analyzed extensively in several different aspects. Moore’s and Hopcroft’s algorithms [20, 25] are the two most common algorithms for minimizing DFAs. Both of these algorithms compute forward bisimulations over DFAs and can be implemented with complexity \(O(kn\,\text {log}\,n)\) (where k is the size of the alphabet). This bound is tight [5,6,7]. The two algorithms, although in different ways, iteratively refine a partition of the set of states until the forward bisimulation is computed. In the case of DFAs, the equivalence relation induced by the bisimulation relation produces a minimal and canonical DFA. In our earlier work, we extended Hopcroft’s algorithm to work with symbolic alphabets [13] and showed how, for deterministic s-FAs, the algorithm can be implemented in \({\mathcal {O}}(m\,\text {log}\,n f(nl))\) for automata with m transitions, n states, and predicates of size l. Here f(x) is the cost of checking satisfiability of predicates of size x. The algorithm proposed in [13] is similar to the greedy algorithm in Fig. 1. The main difference is in the necessity to use \(\textsc {super}(R){\setminus }R\) in the \(\mathbf {SAT}\) checks and that this seemingly small change has drastic complexity implications.

Minimization and State Reduction in Nondeterministic Automata. In the case of NFAs, there exists no canonical minimal automaton and the problem of finding a minimal NFA is known to be PSPACE complete [24]. It is shown in [18] that it is not even possible to efficiently approximate NFA minimization. The original search based algorithm for minimizing NFAs is known as the Kameda-Weiner method [22]. A generalization of the Kameda-Weiner method based on atoms of regular languages [10] was recently introduced in [28]. Most practical approaches for computing small nondeterministic automata use notions of state reductions that do not always produce a minimal NFAs [2]. These techniques are based on computing various kinds of simulation and bisimulation relations. The set of most common such relations has been described in detail and extended to Büchi automata in [23]. In this paper, we are only concerned with performing state reduction by computing forward bisimulations.

Abdulla et al. were the first to observe that forward bisimulation for NFAs could be computed with complexity \({\mathcal {O}}(k m\,\text {log}\,n)\) by keeping track of the number of states each symbol can reach from a certain part of a partition [2]. In their paper, they also proposed an efficient implementation based on BDDs and algebraic decision diagrams for the special case in which the alphabet is a set of bit-vectors. The techniques proposed in [2] are tailored for finite alphabets and the goal of our paper is extending them to arbitrary alphabets that form a decidable Boolean algebra. In this paper, we propose an extension based on our symbolic bag data structure and experimentally show that, unlike for the case of finite alphabets, the counting algorithm is not practical.

Recently, Geldenhuys et al. have proposed a technique for reducing the size of certain classes of NFAs using SAT solvers [17]. In this technique, a SAT formula is used to describe the existence of an NFA that is equivalent to the original one, but has at most k states. Applying these techniques to symbolic automata is an interesting research direction.

Automata with Predicates. The concept of automata with predicates instead of concrete symbols was first mentioned in [31] and was first discussed in [29] in the context of natural language processing. Since then s-FAs have been studied extensively and we have seen algorithms for minimizing deterministic s-FAs  [13] and deterministic s-FAs over trees [14], and extensions of classic logic results to s-FAs  [15]. To the best of our knowledge, the problem of reducing the states and efficiently computing forward bisimulations for nondeterministic s-FAs has not been studied before. The term symbolic automata is sometimes used to refer to automata over finite alphabets where the state space is represented using BDDs [27]. This meaning is different from the one described in this paper.

AutomataDotNet. This is an open source Microsoft Automata project [1] that is an extension of the automata toolkit originally introduced in [30]. The source code (written in C#) of all the algorithms discussed in this paper as well as the source code of the experiments discussed in Sect. 7 are available in [1].