On the State Complexity of the Shuffle of Regular Languages

Brzozowski, Janusz; Jirásková, Galina; Liu, Bo; Rajasekaran, Aayush; Szykuła, Marek

doi:10.1007/978-3-319-41114-9_6

On the State Complexity of the Shuffle of Regular Languages

Janusz Brzozowski¹⁶,
Galina Jirásková¹⁷,
Bo Liu¹⁶,
Aayush Rajasekaran¹⁶ &
…
Marek Szykuła¹⁸

Conference paper
First Online: 28 June 2016

712 Accesses
14 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9777))

Abstract

We investigate the shuffle operation on regular languages represented by complete deterministic finite automata. We prove that $f(m,n)=2^{mn-1} + 2^{(m-1)(n-1)}(2^{m-1}-1)(2^{n-1}-1)$ is an upper bound on the state complexity of the shuffle of two regular languages having state complexities m and n, respectively. We also state partial results about the tightness of this bound. We show that there exist witness languages meeting the bound if $2\leqslant m\leqslant 5$ and $n\geqslant 2$, and also if $m=n=6$. Moreover, we prove that in the subset automaton of the NFA accepting the shuffle, all $2^{mn}$ states can be distinguishable, and an alphabet of size three suffices for that. It follows that the bound can be met if all f(m, n) states are reachable. We know that an alphabet of size at least mn is required provided that $m,n \geqslant 2$. The question of reachability, and hence also of the tightness of the bound f(m, n) in general, remains open.

This work was supported by the Natural Sciences and Engineering Research Council of Canada under grant No. OGP0000871, by VEGA grant 2/0084/15, and by the National Science Centre, Poland under project number 2014/15/B/ST6/00615.

B. Liu—Present address: Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA.

You have full access to this open access chapter, Download conference paper PDF

1 An Upper Bound for the Shuffle Operation

The state complexity of a regular language L [6] is the number of states in a complete minimal deterministic finite automaton (DFA) recognizing the language; it will be denoted by $\kappa (L)$. The state complexity of an operation on regular languages is the maximal state complexity of the result of the operation expressed as a function of the state complexities of the operands.

Let $\varSigma $ be a finite non-empty alphabet. The shuffle of words $u,v\in \varSigma ^*$ is defined as follows:

The shuffle of two languages K and L over $\varSigma $ is defined by

Note that the shuffle operation is commutative on both words and languages.

The state complexity of the shuffle operation was first studied by Câmpeanu et al. [2], but they considered only bounds for incomplete deterministic automata. In particular, they proved that $2^{mn}-1$ is a tight upper bound for that case. Since we can convert an incomplete deterministic automaton into complete one by adding the empty state, it follows that $2^{(m-1)(n-1)}-1$ is a lower bound for the case of complete deterministic automata. Here we show that this lower bound can be improved, and we derive an upper bound for two regular languages represented by complete deterministic automata, but the question whether this bound is tight remains open.

A nondeterministic finite automaton (NFA) is a quintuple ${\mathcal A}=(Q,\varSigma ,\delta ,s,F)$, where Q is a finite non-empty set of states, $\varSigma $ is a finite alphabet of input symbols, $\delta :Q\times \varSigma \rightarrow 2^Q$ is the transition function which is extended to the domain $2^Q\times \varSigma ^*$ in the natural way, $s\in Q$ is the initial state, and $F\subseteq Q$ is the set of final states. The language accepted by NFA ${\mathcal A}$ is the set of words $L({\mathcal A})=\{w\in \varSigma ^*\mid \delta (s,w)\cap F \ne \emptyset \}$.

An NFA ${\mathcal A}$ is deterministic and complete (DFA) if $|\delta (q,a)|=1$ for each q in Q and each a in $\varSigma $. In such a case, we write $\delta (q,a)=q'$ instead of $\delta (q,a)=\{q'\}$. A DFA is minimal (with respect to the number of states) if all its states are reachable, and no two distinct states are equivalent.

Every NFA ${\mathcal A}=(Q,\varSigma ,\delta ,s,F)$ can be converted to an equivalent DFA ${\mathcal A}'=(2^Q,\varSigma ,\delta , \{s\},F')$, where $F'=\{R\in 2^Q \mid R\cap F\ne \emptyset \}$. The DFA ${\mathcal A}'$ is called the subset automaton of NFA ${\mathcal A}$. The subset automaton may not be minimal since some of its states may be unreachable or equivalent to other states.

Let K and L be regular languages over an alphabet $\varSigma $ recognized by deterministic finite automata $ {\mathcal K}=(Q_K,\varSigma ,\delta _K,q_K,F_K)$ and $ {\mathcal L}=(Q_L,\varSigma ,\delta _L,q_L,F_L)$, respectively. Then is accepted by the nondeterministic finite automaton

$${\mathcal N}=(Q_K \times Q_L, \varSigma , \delta , (q_K,q_L), F_K\times F_L),$$

where

$$ \delta ((p,q),a)=\{(\delta _K(p,a),q), (p,\delta _L(q,a))\}. $$

Let ${\mathcal D}=(2^{Q_K\times Q_L},\varSigma ,\delta ',\{(q_K,q_L)\},F')$ be the subset automaton of ${\mathcal N}$. If $|Q_K|=m$ and $|Q_L|=n$, then NFA ${\mathcal N}$ has mn states. It follows that DFA ${\mathcal D}$ has at most $2^{mn}$ reachable and pairwise distinguishable states. However, this upper bound cannot be met, as we will show.

In the sequel, we assume that $Q_K=\{1,2,\ldots ,m\}$, $q_K=1$, $Q_L=\{1,2,\ldots ,n\}$, and $q_L=1$. We say that a state (p, q) of NFA ${\mathcal N}$ is in row i if $p=i$, and it is in column j if $q=j$.

Proposition 1

Let $a\in \varSigma $. Let S be a state of ${\mathcal D}$. Let $\pi _{{{\mathrm{col}}}}(S) = \{p \mid (p,q) \in S\text { for some }q\}$, and $\pi _{{{\mathrm{row}}}}(S) = \{p \mid (p,q) \in S\text { for some }p\}$. Then $\pi _x(S)\subseteq \pi _x(S\cdot a)$ for $x \in \{{{\mathrm{col}}},{{\mathrm{row}}}\}$.

Proof

Let $p \in \pi _{{{\mathrm{col}}}}(S)$; then we have $(p,q) \in S$ for some q. Since $\delta ((p,q),a)=\{(\delta _K(p,a),q), (p,\delta _L(q,a)\}$, we have $(p,\delta _L(q,a))\in \delta (S,a)$, so $p \in \pi _{{{\mathrm{col}}}}(\delta (S,a))$. By symmetry, the same claim holds for $\pi _{{{\mathrm{row}}}}$. $\square $

We claim that in the subset automaton ${\mathcal D}$, every reachable subset S of $Q_K\times Q_L$ must contain a state in column 1 and a state in row 1, that is, it must satisfy the following condition.

Condition (C): There exist states (s, 1) and (1, t) in S for some $s\in Q_K$ and $t\in Q_L$.

Lemma 2

Every reachable subset S of subset automaton ${\mathcal D}$ satisfies Condition (C).

Proof

The initial subset of ${\mathcal D}$ is $\{(1,1)\}$, and it satisfies Condition (C). By Proposition 1, for every $a \in \varSigma $ we get that $1 \in \pi _{{{\mathrm{col}}}}(\delta (S,a))$ and $1 \in \pi _{{{\mathrm{row}}}}(\delta (S,a))$, so $\delta (S,a)$ satisfies Condition (C). By induction, all reachable subsets satisfy Condition (C). $\square $

Theorem 3

(Shuffle: Upper Bound). Let $\kappa (K)=m$ and $\kappa (L)=n$. Then the state complexity of the shuffle of K and L is at most

$$\begin{aligned} f(m,n)=2^{mn-1} + 2^{(m-1)(n-1)}(2^{m-1}-1)(2^{n-1}-1). \end{aligned}$$

(1)

Proof

By Lemma 2, every reachable subset of ${\mathcal D}$ must contain a state in row 1 and a state in column 1. There are $2^{mn-1}$ subsets containing state (1, 1), and $2^{(m-1)(n-1)}(2^{m-1}-1)(2^{n-1}-1)$ subsets not containing (1, 1) but containing (s, 1) for some $s\in \{2,3,\ldots ,m\}$ and (1, t) for some $t\in \{2,3,\ldots ,n\}$. This gives f(m, n). $\square $

Let K and L be two regular languages over $\varSigma $. If $\kappa (K)=\kappa (L)=1$, then each of K, L, and is either $\emptyset $ or $\varSigma ^*$, and ; hence the bound $f(1,1)=1$ is tight.

Now suppose that $\kappa (K)=1$; here we have two possible choices for K, the empty language or $\varSigma ^*$. The first choice leads to . Hence only the second choice is of interest, where the language is the all-sided ideal [1] generated by L. If $\kappa (L)=2$, the upper bound $f(1,2)=2$ is met by the unary language $L=aa^*$. Hence assume that $\kappa (K)=1$ and $\kappa (L)\geqslant 3$. The next observation shows that in such a case, the tight bound is less than $f(1,n)=2^{n-1}$.

Proposition 4

(Okhotin [4]). If $\kappa (L)\geqslant 3$, then the state complexity of is at most $2^{n-2}+1$, and this bound can be reached only if $|\varSigma | \geqslant n-2$.

Okhotin showed that the language $L=(a_1\varSigma ^*a_1 \cup \cdots \cup a_{n-2}\varSigma ^*a_{n-2}) \varSigma ^*$, where $\varSigma =\{a_1,\ldots , a_{n-2}\}$, meets this bound [4]. This takes care of the case $\kappa (K)=1$ and, by symmetry, of the case $\kappa (L)=1$.

In what follows we assume that $m\geqslant 2$ and $n\geqslant 2$. First, let us show that the upper bound f(m, n) cannot be met by regular languages defined over a fixed alphabet.

Proposition 5

Let K and L be regular languages over $\varSigma $ with $\kappa (K)=m$ and $\kappa (L)=n$, where $m,n\geqslant 2$. If , then $|\varSigma | \geqslant mn - 1$.

Proof

For $s=2,3,\ldots ,m$ and $t=2,3,\ldots ,n$ denote

$$\begin{aligned} A_s&=\{(1,1),(s,1)\},\\ B_t&=\{(1,1),(1,t)\}, \\ C_{st}&=\{(s,1),(1,t)\}. \end{aligned}$$

If all the subsets satisfying Condition (C) are reachable, then, in particular, all the subsets $A_s, B_t$, and $C_{st}$ must be reachable. Let us show that all these subsets must be reached from some subsets containing state (1, 1) by distinct symbols.

Suppose that a set $A_s$ is reached from a reachable set S with $S\ne A_s$ by a symbol a, that is, we have $A_s=\delta (S,a)$ and $S\ne A_s$. The set $A_s$ contains only states in column 1 and rows 1 or s. By Proposition 1, the set S may only contain states in column 1 and in rows 1 or s, that is, we have $S\subseteq \{(1,1), (s,1)\}$. Since $S\ne A_s$, we must have $S=\{(1,1)\}$.

By symmetry, each $B_t$ can only be reached from $\{(1,1)\}$.

Suppose that a set $C_{st}$ is reached from a reachable set S with $S\ne C_{st}$ by a symbol a. By Proposition 1, we must have $S\subseteq \{(1,1),(s,1),(1,t),(s,t)\}$. Let us show that $(1,1)\in S$. Suppose for a contradiction that $(1,1)\notin S$. Then, since S is reachable, it must contain a state in column 1 and a state in row 1, that is, we must have $\{(s,1),(1,t)\}\subseteq S$. But then $(s,t)\in S$ since $S\ne C_{st}$. However, then $\delta _K(s,a)=1$ and $\delta _L(t,a)=1$ which implies that $(1,1)\in \delta ((s,1),a)$, and so $(1,1)\in C_{st}$. This is a contradiction. Therefore $C_{st}$ is reached from a set containing (1, 1).

Thus each $A_s$ is reached from $\{(1,1)\}$ by a symbol $a_s$, each $B_t$ is reached from $\{(1,1)\}$ by a symbol $b_t$, each $C_{st}$ is reached from a set containing (1, 1) by a symbol $c_{st}$, and we must have

$$\begin{aligned} \delta _K(1,a_s)=s&\text { and } \delta _L(1,a_s)=1,\\ \delta _K(1,b_t)=1&\text { and } \delta _L(1,b_t)=t,\\ \delta _K(1,c_{st})=s&\text { and } \delta _L(1,c_{st})=t. \end{aligned}$$

It follows that all the symbols $a_s, b_t$, and $c_{st}$ must be pairwise distinct. Therefore we have $|\varSigma |\geqslant m-1 + n-1 + (m-1)(n-1)=mn-1$. $\square $

Unfortunately, this lower bound on the size of the alphabet is not tight, as is demonstrated by the following example:

Example 6

If t is a transformation of the set $ \{1,2,\ldots , n\}$ and $q\in \{1,2,\ldots ,n\}$, let qt be the image of q under t. Transformation t can now be denoted by $[1t, 2t,\dots , nt]$.

(1) If $m=n=2$, we have $f(2,2)=10$. Let $\varSigma =\{a,b,c,d\}$, and let the DFAs ${\mathcal K}$ and ${\mathcal L}$ be as shown in Fig. 1, and let K and L be their languages. Then . We have used GAP [3] to show that the bound cannot be reached with a smaller alphabet, and that the DFAs of Fig. 1 are unique up to isomorphism.

(2) For $m=2$ and $n=3$, the minimal size of the alphabet of a witness pair is 6. We have verified this by a dedicated algorithm enumerating all pairs of non-isomorphic DFAs with 2 and 3 states. In contrast to the previous case, over a minimal alphabet there are more than 60 non-isomorphic DFAs of L – even if we do not distinguish them by sets of final states – that meet the bound with some K. One of the witness pairs is described below.

Let $\varSigma = \{a,b,c,d,e,f\}$. Let ${\mathcal K}=(\{1,2\},\varSigma ,\delta _{K},1,\{2\})$, and let $a=[1,2]$, $b=c=[2,1]$, $d=[1,1]$, $e=[2,2]$, and $f=[2,1]$. Let $ {\mathcal L}=(\{1,2,3\},\varSigma ,\delta _{L},1,\{1\})$, and let $a=[2,2,3]$, $b=[2,1,3]$, $c=[1,1,1]$, $d=e=[3,1,2]$, $f=[3,1,1]$. Then .

The bound $mn-1$ on the size of the alphabet is not tight for $m=n=2$, where an alphabet of size four is required. For any $m,n \geqslant 2$ the subsets of $\{1,2\} \times \{1,2\}$ satisfying (C) must be also reachable, and to reach them we can use only transformations mapping 1 to either 1 or 2. There are only three such transformations counted in Proposition 5; thus we need one more letter.

2 Partial Results About Tightness

To prove that the upper bound f(m, n) of Eq. (1) is tight, we must exhibit two languages K and L with state complexities m and n, respectively, such that . As usual, we use DFAs to represent the languages: Let ${\mathcal K}$ and ${\mathcal L}$ be minimal complete DFAs for K and L. We first construct the NFA ${\mathcal N}$ as defined in Sect. 1, and we consider the subset automaton ${\mathcal D}$ of NFA ${\mathcal N}$. We must then show that ${\mathcal D}$ has f(m, n) states reachable from the initial state $\{(1,1)\}$, and that these states are pairwise distinguishable. We were unable to prove this for all m and n, but we have some partial results about reachability in Subsect. 2.1, and we deal with distinguishability in Subsect. 2.2.

2.1 Reachability

We performed computations verifying reachability of the upper bound for small values of m and n. These results are summarized in Table 1.

The computation in the hardest case with $m=n=6$ took about 48 days on a computer with AMD Opteron(tm) Processor 6380 (2500 MHz) and 64 GB of RAM. Moreover, we verified that in all these cases, every subset of size at least 3 is directly reachable from some smaller subset. We also verified that for reachability in case of $m=n=3$ an alphabet of size 12 is sufficient, and in case of $m=n=4$ an alphabet of size 50 is sufficient. Using these results, we are going to prove reachability for all m, n with $2\leqslant m \leqslant 5$ and $n\geqslant 2$.

Table 1. Computational verification of reachability of the bound. The fields with $\checkmark ^*$ follow from the proofs of Subsect. 2.1.

Full size table

Without loss of generality, the set of states of any n-state DFA is denoted by $Q_n=\{1,2,\dots ,n\}$. Let ${\mathcal T}_n$ be the monoid of all transformations of the set $Q_n$. Let $p,q\in Q_n$ and $P\subseteq Q_n$. Let $\mathbf {1}$ denote the identity transformation. Let $(p\rightarrow q)$ denote the transformation that maps state p to state q and acts as the identity on all the other states. Let (p, q) denote the transformation that transposes p and q.

Here we deal only with reachability, so final states do not matter. We assume that the sets of final states are empty in this subsection.

Let $\varSigma _{m,n}=\{ a_{s,t} \mid s\in {\mathcal T}_m \text { and } t\in {\mathcal T}_n\}$ be an alphabet consisting of $m^m n^n$ symbols. If an input a induces transformations s in ${\mathcal T}_m$ and t in ${\mathcal T}_n$, this will be indicated by $a:s;t$.

Define DFAs ${\mathcal K}_{m,n}=(Q_m,\varSigma _{m,n},\delta _m,1, \emptyset )$ and ${\mathcal L}_{m,n} = (Q_n, \varSigma _{m,n}, \delta _n, 1, \emptyset )$, where $\delta _m(p,a_{s,t}) = p s$ if $p\in Q_m$ and $\delta _n(q,a_{s,t}) =q t$ if $q \in Q_n$. Let ${\mathcal N}_{m,n}$ be the NFA for the shuffle of languages recognized by DFAs ${\mathcal K}_{m,n}$ and ${\mathcal L}_{m,n}$ as described in Sect. 1, and let ${\mathcal D}_{m,n}$ be the subset automaton of ${\mathcal N}_{m,n}$. The NFA ${\mathcal N}_{m,n}$ has alphabet $\varSigma _{m,n}$, and so has an input letter for every pair of transformations in ${\mathcal T}_m\times {\mathcal T}_n$. Therefore the addition of another input letter to the DFAs ${\mathcal K}_{m,n}$ and ${\mathcal L}_{m,n}$ cannot add any new set of states of ${\mathcal N}_{m,n}$ that would be reachable from $\{(1,1)\}$ in ${\mathcal D}_{m,n}$.

Let $m'\leqslant m$ and $n'\leqslant n$. Then DFA ${\mathcal K}_{m',n'}=(Q_{m'},\varSigma _{m',n'}, \delta _{m'}, 1, \emptyset )$ (respectively, the DFA ${\mathcal L}_{m',n'}=(Q_{n'},\varSigma _{m',n'}, \delta _{n'}, 1, \emptyset )$) is a sub-DFA of ${\mathcal K}_{m,n}$ (respectively, of ${\mathcal L}_{m,n}$), in the sense that $Q_{m'} \subseteq Q_m$, $\varSigma _{m',n'} \subseteq \varSigma _{m,n}$, and $\delta _{m'} \subseteq \delta _m$. As well, NFA ${\mathcal N}_{m',n'}$ is a sub-NFA of ${\mathcal N}_{m,n}$. Note that ${\mathcal D}_{m,n}$ is extremal for the shuffle: every language , where K and L are languages with state complexities m and n respectively, is recognized by some sub-DFA of ${\mathcal D}(m,n)$ after possibly renaming some letters.

For the next lemma it is convenient to consider a subset S of states (p, q) of ${\mathcal N}_{m,n}$ as an $m\times n$ matrix, where the entry in row p and column q is (p, q) if $(p,q)\in S$, and it is empty otherwise. We first introduce the following notions.

Definition 7

Let $i,i'\in Q_m$, $i\ne i'$, and $j,j'\in Q_n$, $j\ne j'$.

(a) A row $i'$ contains row i, if $(i,j)\in S$ implies $(i',j) \in S$ for all $j \in Q_n$.

(b) A column $j'$ contains column j if $(i,j) \in S$ implies $(i,j') \in S$ for all $i \in Q_m$.

(c) A subset of $Q_m \times Q_n$ is valid if it satisfies Condition (C) from Lemma 2, that is, if it contains a state in row 1 and a state in column 1.

Lemma 8

Let S be a valid subset of $Q_m \times Q_n$ with the property that there are distinct $i,i'$ or $j, j'$ such that either row $i'$ contains row i or column $j'$ contains column j. Assume that every valid subset $S'$ of $Q_{m'} \times Q_{n'}$, where $m' < m$, or $n' < n$, or $|S'| < |S|$, is reachable in DFA ${\mathcal D}_{m',n'}$. Then S is reachable in ${\mathcal D}_{m,n}$.

Proof

If S contains an empty row or column, then without loss of generality we can renumber the n states of ${\mathcal L}_{m,n}$ in such a way that column n is the empty column in S. By the inductive assumption we know that S is reachable in ${\mathcal D}_{m,n-1}$ by some word w. Since ${\mathcal N}_{m,n-1}$ is a sub-NFA of ${\mathcal N}_{m,n}$, S is reachable in ${\mathcal D}_{m,n}$ as well by the same word. Suppose that S has neither an empty row nor an empty column. By symmetry, it is sufficient to consider the case with distinct i and $i'$ such that row $i'$ contains row i. Let $S' = S \setminus \{(i',j) \mid (i,j) \in S\text { for }j \in \{1,\ldots ,n\}\}$. Since $|S'| < |S|$, the set $S'$ is reachable by assumption. To obtain S, we apply the letter that induces the transformation $i \rightarrow i'; {\mathbf 1}$. $\square $

Lemma 9

Let S be a valid subset of $Q_m \times Q_n$ such that there is a column or a row with exactly one element. Assume that every valid subset $S'$ of $Q_{m'} \times Q_{n'}$, where $m' < m$, or $n' < n$, or $|S'| < |S|$, is reachable in ${\mathcal D}_{m', n'}$. Then S is reachable in ${\mathcal D}_{m,n}$.

Proof

Recall that we can assume $m \geqslant 2$ and $n \geqslant 2$. We may assume that there is neither an empty row nor an empty column in S; otherwise S is reachable by Lemma 8. It is sufficient to consider the case involving a column, since the case involving a row follows by symmetric arguments. Let (p, q) be the only element in column q. If there are more elements in row p, then column q is contained in another column and by Lemma 8, the set S is reachable.

Let $S'$ be the subset of $Q_{m-1} \times Q_{n-1}$ obtained by removing row p and column q, and renumbering the states to $Q_{m-1} \times Q_{n-1}$ in the way such that $i \in Q_m$ becomes $i-1$ if $i>p$ and otherwise remains the same, and $j \in Q_n$ becomes $j-1$ if $j>q$ and otherwise remains the same. We have that $S'$ is a valid subset, and by the inductive assumption it is reachable in ${\mathcal D}_{m-1,n-1}$ by some word $u'$; let u be the word corresponding to $u'$ in the original numbering of the states. We consider four cases.

Case $p \ne 1$ and $q \ne 1$: State $\{(1,1),(p,q)\}$ is reachable in ${\mathcal D}_{m,n}$ by word $a^2$, where $a:(1,p); (1,q)$. Then S is reachable by $a^2 u$.

Case $p = 1$ and $q \ne 1$: State $\{(2,1),(1,q)\}$ is reachable in ${\mathcal D}_{m,n}$ by word $a^2$, where $a:(1,2); (1,q)$. Then state (2, 1) corresponds to state (1, 1) after the renumbering, and S is reachable by $a^2 u$.

Case $p \ne 1$ and $q = 1$: This is symmetrical to the previous case.

Case $p = 1$ and $q = 1$: State $\{(1,1),(2,2)\}$ is reachable in ${\mathcal D}_{m,n}$ by word $a^2$, where $a:(1,2); (1,2)$. Then state (2, 2) corresponds to state (1, 1) after the renumbering, and S is reachable by $a^2 u$. $\square $

Theorem 10

If for some h every valid subset can be reached in ${\mathcal D}_{h,\left( {\begin{array}{c}h\\ \lfloor h/2 \rfloor \end{array}}\right) }$ then for every $m \leqslant h$ and every n, every valid subset can be reached in ${\mathcal D}_{m,n}$.

Proof

This follows by induction on m, n, and |S|.

For $m=1$ this follows by induction on n: if $n=1$ then ${\mathcal D}_{1,1}$ consists of a single valid subset $\{(1,1)\}$, and if $n>1$, then we apply Lemma 8. For $m \leqslant h$ and $n \leqslant \left( {\begin{array}{c}h\\ \lfloor h/2 \rfloor \end{array}}\right) $ this holds by assumption, since ${\mathcal N}_{m,n}$ is a sub-NFA of ${\mathcal N}_{h,\left( {\begin{array}{c}h\\ \lfloor h/2 \rfloor \end{array}}\right) }$. If $|S|=1$, then $\{(1,1)\}$ is the only valid subset, and it is reachable since it is the initial subset of ${\mathcal D}_{m,n}$.

Let S be a valid subset of $Q_{m} \times Q_{n}$, where $m \leqslant h$ and $n > \left( {\begin{array}{c}h\\ \lfloor h/2 \rfloor \end{array}}\right) $, and assume that every valid subset $S'$ of $Q_{m'} \times Q_{n'}$ is reachable if $m' < m$, or $n' < n$, or $|S'| < |S|$. By Sperner’s theorem [5], the maximal number of subsets of an m-element set such that none of them contains any other subset is $\left( {\begin{array}{c}m\\ \lfloor m/2 \rfloor \end{array}}\right) $. This is not larger than $\left( {\begin{array}{c}h\\ \lfloor h/2 \rfloor \end{array}}\right) $; hence, there exist some columns $j,j'$ with $j\ne j'$ such that the j-th column is contained in $j'$-th column. By Lemma 8, the subset S is reachable. $\square $

Corollary 11

Let $1\leqslant m\leqslant 4$ and $n\geqslant 1$. Then every valid subset can be reached in ${\mathcal D}_{m,n}$.

Proof

Since we have verified the reachability of all valid subsets for $m=4$ and $n=6=\left( {\begin{array}{c}4\\ 2\end{array}}\right) $, Theorem 10 applies with $h=4$. $\square $

To strengthen this result and show reachability for $m \leqslant 5$, we need to introduce another concept with permutations. Let $\varphi $ be any permutation of m rows. We split subsets of $Q_m$ (subsets of rows) into equivalence classes under $\varphi $. For $U \subseteq Q_m$, $[U]_\varphi = \{V \subseteq Q_m \mid V= \varphi ^i(U)\text { for some }i \geqslant 0\}$ denotes the equivalence class of U. See Tables 2, 3, 4 for examples of subsets whose columns U are partitioned into equivalence classes under some $\varphi $.

For a subset S of $Q_m \times Q_n$, by ${{\mathrm{col}}}(S,i)$ we denote the subset of $Q_m$ contained in the i-th column. Then ${{\mathrm{cols}}}(S) = \bigcup _{1 \leqslant i \leqslant n} {{\mathrm{col}}}(S,i)$ is the set of the subsets in the columns of S.

The following lemma assures reachability (under an inductive assumption) of a special kind of subsets whose columns form only full and empty equivalence classes under some permutation $\varphi $.

Lemma 12

Let $\varphi $ be a permutation of m rows. Let S be a valid subset of $Q_m~\times ~Q_n$ such that $[U]_\varphi \subseteq {{\mathrm{cols}}}(S)$ for every $U\in {{\mathrm{cols}}}(S)$, and there is a column $V\in {{\mathrm{cols}}}(S)$ such that $|[V]_\varphi |\geqslant 2$. Assume that every valid subset $S'$ of $Q_{m'} \times Q_{n'}$, where $m' < m$, or $n' < n$, or $|S'| < |S|$, is reachable in ${\mathcal D}_{m',n'}$. Then S is reachable in ${\mathcal D}_{m,n}$.

Proof

We can assume that no two columns contain the same subset of rows, no column is empty, and the first row contains at least two elements; otherwise S is reachable by Lemma 8 or by Lemma 9.

Let $S_j={{\mathrm{col}}}(S,j)$ be the j-th column of a valid subset S. Thus we have $S=\{(i,j)\mid 1\leqslant j \leqslant n \text { and } i\in S_j \}.$ Since $|[V]_\varphi |\geqslant 2$, we can always choose V so that $\varphi ^{-1}(V)$ is in a k-th column $S_k$ with $k\ne 1$. Let $S'$ be the set obtained from S by omitting the states in the k-th column and by taking the pre-image of $S_j$ under $\varphi $ in any other column, that is,

$$ S'=\{(i,j)\mid 1\leqslant j \leqslant n, j\ne k, \text { and } i\in \varphi ^{-1}(S_j) \}. $$

Since $k\ne 1$ and the first row of S contains at least two elements, the set $S'$ is valid. Since V is non-empty, we have $|S'| < |S|$. Let $\psi $ be a permutation that maps a column j to the column containing $\varphi ^{-1}(S_j)$, that is, we have $S_{\psi (j)}= \varphi ^{-1}(S_j)$. Let t be the transformation given by $a_{\varphi ,\psi }$. Let us show that $S' t = S$.

Let $(i,j)\in S'$. Then $i\in \varphi ^{-1}(S_j)$, so $\varphi (i)\in S_j$, and we have $(i,j)t = \{(\varphi (i),j), (i, \psi (j))\}\subseteq S$. Hence $S' t \subseteq S$.

Now let $(i,j)\in S$. First let $j\ne k$. Then $i\in S_j$, so $\varphi ^{-1}(i)\in \varphi ^{-1}(S_j)$. Therefore $(\varphi ^{-1}(i),j)\in ~S'$. Since $(i,j)\in (\varphi ^{-1}(i),j) t $, we have $(i,j)\in S' t$. Now let $j=k$. Then $i\in \varphi ^{-1}(V) $ and $S_{\psi ^{-1}(k)} = V$. Thus $(i, \psi ^{-1}(k))\in S'$, and we have $(i,k)\in (i, \psi ^{-1}(k)) t $. Hence $S\subseteq S't$. Our proof is complete. $\square $

Table 2. A subset and the equivalence classes of columns under $\varphi = [2,3,1,4,5]$.

Full size table

Table 3. A subset and the equivalence classes of columns under $\varphi = [1,2,3,5,4]$.

Full size table

Table 4. A subset and the equivalence classes of columns under $\varphi = [2,3,4,1,5]$.

Full size table

Corollary 13

Let $1\leqslant m \leqslant 5$ and $n\geqslant 1$. Then every valid subset can be reached in ${\mathcal D}_{m,n}$.

Proof

The proof follows by analysis of valid subsets $S \subseteq Q_5 \times Q_n$, with the aid of Corollary 11, Lemmas 8 and 12, and the results from Table 1.

Suppose that there is a valid subset $S \subseteq Q_5 \times Q_n$ that is not reachable; let S be chosen so that n is the smallest number and S is a smallest non-reachable subset of $Q_5 \times Q_n$.

By Corollary 11 and the choice of n, every valid subset $S' \subset Q_{m'} \times Q_{n'}$, where $m' < 5$, or $n' < n$, or $|S'| < |S|$, is reachable. Hence, S has no column containing another column; otherwise, we can apply Lemma 8. Since we have verified the reachability of all valid subsets for $m=5$ and $n\leqslant 7$ (Table 1), we must have $n \geqslant 8$ and so S has at least 8 distinct columns. Obviously there is neither an empty nor a full column. If there is a column U with $|U|=1$ or $|U|=4$, then by Sperner’s theorem if $n > \left( {\begin{array}{c}4\\ 2\end{array}}\right) = 6$, then S has a column containing another column; hence S can have only columns U with $|U|=3$ or $|U|=2$.

Let $C_3$ be the number of 3-element columns ($|U|=3$), and $C_2$ be the number of 2-element columns ($|U|=2$). We are searching for possible subsets S that do not have a column containing another column, and with $C_3+C_2 \geqslant 8$. We consider the following six cases.

(1) Let $C_3=0$. If $C_2 = 10$, which implies that S contains all possible 2-element subsets, then under $\varphi = [2,3,4,5,1]$ we have two full and non-trivial equivalence classes. Hence S is reachable from a smaller subset by Lemma 12. If $C_2 = 9$, then without loss of generality let the missing 2-element subset be $\{4,5\}$; see Table 2. Under $\varphi = [2,3,1,4,5]$ we have three full and non-trivial equivalence classes, and S is reachable by Lemma 12. Finally, if $C_2 = 8$, then we have two subcases. If the two missing 2-element subsets have a common element, then without loss of generality let them be $\{2,3\}$ and $\{4,5\}$. Under $\varphi = [1,4,5,2,3]$ we have four full and non-trivial equivalence classes, and S is reachable by Lemma 12. If they have a common element, then without loss of generality let them be $\{3,4\}$ and $\{4,5\}$. Under $\varphi = [1,2,5,4,3]$ we have six full equivalence classes and two of them are non-trivial. Thus S is reachable by Lemma 12.

(2) Let $C_3=1$. The only possible subset, up to permutation of columns and rows, is shown in Table 3. It has all columns with two elements that are not contained in the 3-element column. By Lemma 12 with $\varphi = [1,2,3,5,4]$, it is reachable.

(3) Let $C_3=2$. A simple analysis reveals that if the 3-element columns have only one common element, then $C_2$ is at most 4. If they have two common elements, then $C_2$ is at most 5. Thus in this case, we have $C_2+C_3\leqslant 7$.

(4) Let $C_3=3$. Here $C_2$ is at most 4.

(5) Let $C_3=4$. The only possible subset, up to permutation of columns and rows, is shown in Table 4. By Lemma 12 with $\varphi = [2,3,4,1,5]$, it is reachable.

(6) Let $C_3\geqslant 5$. These cases are symmetrical to those with $C_3 \leqslant 3$; it is sufficient to consider the complement of S.

Since these cover all the possibilities for set S, this set is reachable. $\square $

2.2 Proof of Distinguishability

The aim of this section is to show that there are regular languages defined over a three-letter alphabet such that the subset automaton of the NFA for their shuffle does not have equivalent states.

To this aim let ${\mathcal A}= (Q,\varSigma ,\delta ,s,F)$ be an NFA. We say that a state q in Q is uniquely distinguishable if there is a word w in $\varSigma ^*$ which is accepted by ${\mathcal A}$ from and only from the state q, that is, if there is a word w such that $\delta (p,w)\in F$ if and only if $p=q$. First, let us prove the following two observations.

Proposition 14

If each state of an NFA ${\mathcal A}$ is uniquely distinguishable, then the subset automaton of ${\mathcal A}$ does not have equivalent states.

Proof

Let S and T be two distinct subsets in $2^Q$. Then, without loss of generality, there is a state q in Q with $q\in S \setminus T$. Since q is uniquely distinguishable, there is a word w which is accepted by ${\mathcal A}$ from and only from q. Therefore, the subset automaton of ${\mathcal A}$ accepts w from S and it rejects w from T. Hence w distinguishes S and T. $\square $

Proposition 15

Let a state q of an NFA ${\mathcal A}= (Q,\varSigma ,\delta ,s,F)$ be uniquely distinguishable. Assume that there is a symbol a in $\varSigma $ and exactly one state p in Q that goes to q on a, that is, (p, a, q) is a unique in-transition on a going to q. Then the state p is uniquely distinguishable as well.

Proof

Let w be a word which is accepted by ${\mathcal A}$ from and only from q. The word aw is accepted from p since $q \in \delta (p,a)$ and w is accepted from q. Let $r\ne p$. Then $q\notin \delta (r,a)$ since (p, a, q) is a unique in-transition on a going to q. It follows that the word w is not accepted from any state in $\delta (r,a)$. Thus ${\mathcal A}$ rejects aw from r, so p is uniquely distinguishable. $\square $

Now we can prove the following result.

Theorem 16

Let $m,n\geqslant 2$. There exist ternary languages K and L with $\kappa (K)=m$ and $\kappa (L)=n$ such that the subset automaton of the NFA accepting does not have equivalent states.

Proof

Let m and n be arbitrary but fixed integers with $m,n\geqslant 2$. Let K be accepted by the DFA ${\mathcal K}=(\{1,2,\ldots ,m\},\{a,b,c\},\delta _K,1, \{m\})$, where for each i in $\{1,2,\ldots ,m\}$,

$\delta _K(i,a)=i+1$ if $i\leqslant m-1$ and $\delta _K(m,a)=1$;

$\delta _K(i,b)=1$;

$\delta _K(1,c)=2$ and $\delta _K(i,c)=1$ if $i\geqslant 2$.

Let L be accepted by the DFA ${\mathcal L}=(\{1,2,\ldots ,n\},\{a,b,c\},\delta _L,1, \{n\})$, where for each j in $\{1,2,\ldots ,n\}$,

$\delta _L(j,a)=1$;

$\delta _L(j,b)=j+1$ if $j\leqslant n-1$ and $\delta _L(n,b)=1$;

$\delta _L(j,c)=n$.

The DFAs $ {\mathcal K}$ and ${\mathcal L}$ are shown in Fig. 2.

Construct the NFA ${\mathcal N}$ for as described in Sect. 1 on page 2. The transitions on a, b, c in ${\mathcal N}$ for $m=4$ and $n=5$ are shown in Fig. 3. Notice that each state (i, j) with $2\leqslant i \leqslant m$ and $2 \leqslant j \leqslant n$ has a unique in-transition on symbol a and this transition goes from state $(i-1,j)$; see the dashed transitions in Fig. 3 (top-left). Next, each state (m, j) with $2\leqslant j\leqslant n$ has a unique in-transition on b which goes from $(m,j-1)$, and each state (i, 2) with $2\leqslant i\leqslant m$ has a unique in-transition on b going from (i, 1); see the dashed transitions in Fig. 3 (top-right). Finally, the state (2, 1) has a unique in-transition on c going from (1, 1); see the dashed transition in Fig. 3 (bottom).

The empty word is accepted by ${\mathcal N}$ from and only from the state (m, n) since this is a unique accepting state of ${\mathcal N}$. Thus (m, n) is uniquely distinguishable. Next, consider the subgraph of unique in-transitions in ${\mathcal N}$. Figure 4 shows this subgraph in the case of $m=4$ and $n=5$. Notice that from each state of ${\mathcal N}$, the state (m, n) is reachable in this subgraph. By Proposition 15, used repeatedly, we get that each state of ${\mathcal N}$ is uniquely distinguishable. Hence by Proposition 14, the subset automaton of ${\mathcal N}$ does not have equivalent states. $\square $

3 Conclusions

We have examined the state complexity of the shuffle operation on two regular languages of state complexities m and n, respectively, and found an upper bound for it. We know that this bound can be reached for any m with $1 \leqslant m \leqslant 5$ and any $n \geqslant 1$, and also for $m=n=6$. For the remaining values of m and n, however, the problem remains open. Since there exist two languages K and L for which all pairs of states in the subset automaton of the NFA accepting the shuffle are distinguishable, the main difficulty consists of proving that all valid states in the subset automaton can be reached for the witness languages.

References

Brzozowski, J., Jirásková, G., Li, B.: Quotient complexity of ideal languages. Theoret. Comput. Sci. 470, 36–52 (2013)
Article MathSciNet MATH Google Scholar
Câmpeanu, C., Salomaa, K., Yu, S.: Tight lower bound for the state complexity of shuffle of regular languages. J. Autom. Lang. Comb. 7(3), 303–310 (2002)
MathSciNet MATH Google Scholar
The GAP Group: GAP – Groups, Algorithms, and Programming, Version 4.8.3 (2016). http://www.gap-system.org
Okhotin, A.: On the state complexity of scattered substrings and superstrings. Fund. Inform. 99(3), 325–338 (2010)
MathSciNet MATH Google Scholar
Sperner, E.: Ein Satz über Untermengen einer endlichen Menge. Math. Z. 27, 544–548 (1928)
Article MathSciNet MATH Google Scholar
Yu, S.: State complexity of regular languages. J. Autom. Lang. Comb. 6, 221–234 (2001)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

We would like to thank an anonymous referee for proposing the notions of a uniquely distinguishable state and of a subgraph of unique in-transitions which allow us to simplify the proof of distinguishability. We are also grateful for his comments and suggestions that helped us improve the presentation of the paper.

Author information

Authors and Affiliations

David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Janusz Brzozowski, Bo Liu & Aayush Rajasekaran
Mathematical Institute, Slovak Academy of Sciences, Grešákova 6, 040 01, Košice, Slovakia
Galina Jirásková
Institute of Computer Science, University of Wrocław, Joliot-Curie 15, 50-383, Wrocław, Poland
Marek Szykuła

Authors

Janusz Brzozowski
View author publications
You can also search for this author in PubMed Google Scholar
Galina Jirásková
View author publications
You can also search for this author in PubMed Google Scholar
Bo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Aayush Rajasekaran
View author publications
You can also search for this author in PubMed Google Scholar
Marek Szykuła
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marek Szykuła .

Editor information

Editors and Affiliations

University of Prince Edward Island, Charlottetown, Canada
Cezar Câmpeanu
Universität Kiel, Kiel, Germany
Florin Manea
School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada
Jeffrey Shallit

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brzozowski, J., Jirásková, G., Liu, B., Rajasekaran, A., Szykuła, M. (2016). On the State Complexity of the Shuffle of Regular Languages. In: Câmpeanu, C., Manea, F., Shallit, J. (eds) Descriptional Complexity of Formal Systems. DCFS 2016. Lecture Notes in Computer Science(), vol 9777. Springer, Cham. https://doi.org/10.1007/978-3-319-41114-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-41114-9_6
Published: 28 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41113-2
Online ISBN: 978-3-319-41114-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 An Upper Bound for the Shuffle Operation

Proposition 1

Proof

Lemma 2

Proof

Theorem 3

Proof

Proposition 4

Proposition 5

Proof

Example 6

2 Partial Results About Tightness

2.1 Reachability

Definition 7

Lemma 8

Proof

Lemma 9

Proof

Theorem 10

Proof

Corollary 11

Proof

Lemma 12

Proof

Corollary 13

Proof

2.2 Proof of Distinguishability

Proposition 14

Proof

Proposition 15

Proof

Theorem 16

Proof

3 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation