Keywords

1 Introduction

As a central part of Substitution-Permutation Networks, diffusion layers are very important for the overall security and efficiency of cryptographic schemes. On the one hand, they play a role in spreading internal dependencies, which contributes to enhancing the resistance of statistical cryptanalysis. On the other hand, with the rapid development of lightweight cryptography, designing hardware-efficient diffusion layers has already been a hot research topic due to the increasing importance of ubiquitous computing.

The quality of a diffusion layer is connected to its branch number, whose cryptographic significance corresponds to the minimal number of active S-boxes in any two consecutive rounds. Obviously, the larger the branch number is, the better the diffusion effect will be, and simultaneously the cipher will not be vulnerable to unexpected attacks. Therefore, most designers chose to focus on diffusion layers with the best possible branch number to ensure a relatively strongest security.

From a coding theory perspective, Maximum Distance Separable (MDS) codes are quite good choices for the construction of diffusion layers since their branch numbers are maximum (known as the Singleton bound [1]). Not only are MDS matrices used in many block ciphers [24], but they promote generations of various related design strategies [57]. However, a problem with using MDS matrices is that they usually come at the price of a less efficient implementation. Due to Galois field multiplications, hardware implementations will often suffer from an important area requirement, with the result that MDS matrices are not suitable for the resource-constrained environments, such as RFID systems and sensor networks. Although this unfavorable situation is greatly improved with the advent of recursive MDS matrices [810], the temporary memory required (and hence hardware area) for the computation of matrices is still not reduced to a degree of satisfaction sometimes.

Another attractive type of diffusion layers is derived from Maximum Distance Binary Linear (MDBL) codes. The corresponding binary matrices are optimal in the sense that they achieve the largest possible branch number. Though the diffusion speed of optimal binary matrices can not keep pace with the one of MDS matrices, it is an overwhelming advantage that they involve no finite field multiplication, which is more propitious to a low-cost implementation. Typical examples are block ciphers E2 [11], Camellia [12] and ARIA [13], who get an excellent hardware efficiency and remarkable software performance on various platforms as well. It is accordingly our belief that, in many cases, it is easier to obtain an overall construction through using optimal binary matrix (or in general a matrix with branch number not meeting the Singleton bound), despite sacrificing the diffusion speed to a certain extent.

Compared with the study on constructions of MDS matrices, the research on designs of MDBL matrices is rather limited [14, 15]. Early strategy from [16] (partially) guided the design of diffusion layers in E2 and Camellia, and unified method presented in [17] was conducive to summarizing the characteristics of \(8 \times 8\) optimal binary matrices. For constructions of large dimensions (e.g. \(16 \times 16\) and \(32 \times 32\)), designers in [18, 19] considered combining small matrices into bigger ones, where each block matrix corresponds a finite field element. Indeed, in our opinion, the generalities of most previous constructions (focusing only on a few dimensions) are very weak, not to mention making them have efficient implementation. Here, one exception is the proposal of Dehnavi et al. [20], who recently investigated a special kind of binary linear layers for commonly used sizes with efficient implementation.

Feistel structure is one of the most prominently used structures in cryptography and accounts for substantial portion of data encrypted today. This was facilitated by the introduction of DES [21], which indicated the generation of modern block cipher. Not only is this classical structure used in large quantities of symmetric-key algorithms, but it inspires plenty of designs of cryptographic primitives. For example, S-boxes in [2224] are constructed by 3-round Feistel structure, and linear layers in E2 and Camellia are implicitly implemented with 4-round Feistel structure.

Our Contributions. In this paper, we propose constructing diffusion layers over \(\mathbb {F}_2\) with maximal branch number and efficient hardware implementation by use of Feistel structure with bit permutation as round functions. After introducing necessary notations and concepts in Sect. 2, we investigate the bounds on the branch number this construction can achieve, which will later help us judge whether hardware efficiency (hereby focus mainly on the area and latency) of the resulting matrix is optimal. Meanwhile, taking account of the improvement of software performance, we restrict the round function to cyclic shift and give the overall search strategy for “optimal solutions” in Sect. 3.

In order to demonstrate the generality of our construction, we provide typical optimal solutions for a series of feasible parameters (up to 32) in Sect. 4. To the best of our knowledge, the hardware cost of most proposals is the lowest compared with previous results. For cryptographic applications, our focus is further placed on involutory optimal solutions with commonly used dimensions in Sect. 5, and we prove that it is impossible to obtain an involutory \(8 \times 8\) optimal diffusion layer from this structure.

Along similar lines, we present diffusion layers constructed by Generalized Feistel Structure in Sect. 6, improving their applicabilities on other platforms without loss of hardware efficiency. According to figures listed in Sect. 7, we afterwards show that optimal solutions for size \(16 \times 16\) and \(32 \times 32\) can save about \(20\,\%\) and \(33.3\,\%\) gate equivalents respectively, compared to the best known results. Finally, we conclude the paper in Sect. 8.

2 Preliminaries

In this section, we fix the basic notions and further more introduce several judgement methods of branch number. Since diffusion layers investigated in the present paper are linear transformations on the n-dimensional vector space over \(\mathbb {F}_{2}\), we directly use an \(n \times n\) binary matrix to represent a linear layer in the subsequent discussions.

2.1 Branch Number

Assume \(\mathbf{{v}} = ({v_1},{v_2}, \ldots ,{v_n})^T\) is a vector such that \({v_i} \in \mathbb {F}_{2}\), \(1 \le i \le n\). Then the Hamming weight of \(\mathbf{{v}}\), denoted by \({w_b}(\mathbf{{v}})\), is equal to the number of non-zero elements in \(\mathbf {v}\).

Definition 1

[3] The differential branch number of a diffusion layer D is given by

$$\begin{aligned} {\mathcal {B}_d}(D) = \mathop {\min }\limits _{\mathbf{{v}} \ne 0} \{ {w_b}(\mathbf{{v}}) + {w_b}(D(\mathbf{{v}}))\}. \end{aligned}$$
(1)

Analogously, we can define the linear branch number.

Definition 2

[3] The linear branch number of a diffusion layer D is given by

$$\begin{aligned} {\mathcal {B}_l}(D) = \mathop {\min }\limits _{\mathbf{{v}} \ne 0} \{ {w_b}(\mathbf{{v}}) + {w_b}(D^T(\mathbf{{v}}))\}, \end{aligned}$$
(2)

where \({D^T}\) is the transposition of D.

As the differential branch number of an \(n \times n\) linear transformation is equal to the minimum distance of its associated [2nn] linear code, the maximal \(\mathcal {B}_d\) of a binary matrix is known for small dimension according to [25]. A binary matrix is optimal if it achieves the maximal \(\mathcal {B}_d\) and \(\mathcal {B}_l\). Since each \(n \times n\) (with the exception of \(n = 32\)) diffusion layer over \(\mathbb {F}_{2}\) constructed in this article satisfies \(\mathcal {B}_d = \mathcal {B}_l\), we omit linear branch number in the sequel.

Definition 3

[14] Two matrices A, B are permutation homomorphic to each other if there exists a row permutation \(\rho \) and a column permutation \(\gamma \) satisfying

$$\begin{aligned} \rho (\gamma (A)) = \gamma (\rho (A)) = B. \end{aligned}$$
(3)

Proposition 1

[14] If two matrices A, B are permutation homomorphic to each other, then A, B are of the same branch number.

2.2 Judgement Methods

There is a one-to-one correspondence between an \(n \times n\) linear transformation \(\theta (x) = M \cdot x\) and a linear code \(\mathcal {C}_\theta \) with the generator matrix \({G_\theta } = [{I_{n \times n}}|{M}]\), so we can use the following property to determine \(\mathcal {B}_d(M)\).

Proposition 2

[3] A linear code has minimum distance d if and only if every \(d - 1\) columns of its parity check matrix are linearly independent and there exists some set of d columns that are linearly dependent.

To deal with an \(n \times n\) binary matrix with branch number s, it costs approximately \(\sum _{i=1}^{s}{2n\atopwithdelims ()i}\) Gaussian eliminations according to Proposition 2, while it needs to exhaust all possible non-zero input vectors based on Definition 1. After analyzing the characteristics of minimum-weight codewords among \(\mathcal {C}_\theta \), we give Algorithm 2 (cf. Appendix A for more details) as main detection method, reducing the time complexity to \(2\sum _{i=1}^{\left\lfloor {s/2} \right\rfloor }{n\atopwithdelims ()i}\).

3 On Properties of Proposed Diffusion Layers

3.1 The General Construction

Throughout this paper, we consider diffusion layers over \(\mathbb {F}_2\) constructed by Feistel structure with bit permutation as round functions. Let \(x = ({x_L},{x_R})\) and \(y = ({y_L},{y_R})\) be the n-bit input and output respectively, then a diffusion layer shown in Fig. 1 can be characterized as

$$\begin{aligned} M = \left( {\begin{array}{*{20}{c}} 0 &{} I \\ I &{} 0 \\ \end{array}} \right) \left( {\begin{array}{*{20}{c}} {{P_r}} &{} I \\ I &{} 0 \\ \end{array}} \right) \cdots \left( {\begin{array}{*{20}{c}} {{P_1}} &{} I \\ I &{} 0 \\ \end{array}} \right) , \end{aligned}$$
(4)

where the size of each block matrix is \(\frac{n}{2} \times \frac{n}{2}\). In the rest of this paper, we only extract the sequence of permutation matrices, namely, \([{P_1},{P_2}, \ldots ,{P_r}]\) to represent M for simplicity. According to [21], it holds that

$$\begin{aligned} {M^{ - 1}} = [{P_r},{P_{r - 1}}, \ldots ,{P_1}], \end{aligned}$$
(5)

which means the inverse matrix could be implemented with the same structure by simply reversing the order of round functions. This decided advantage guarantees the diffusion layer and its inverse require equal XOR’s in terms of hardware implementation. In particular, the encryption and decryption can even use the exact same circuit in the case of involutory instances, i.e. round transformations appearing symmetrically (e.g. \([{P_1},{P_2},{P_1}]\) and \([{P_1},{P_2},{P_2},{P_1}]\)).

Fig. 1.
figure 1figure 1

A diffusion layer over \(\mathbb {F}_2\) constructed with Feistel structure

As hardware efficiency can have very different meanings depending on the utilization scenario targeted by the designer, we hereby chose to focus on two classical metrics: silicon area and latency. Clearly, to make the perfect diffusion layer hardware-optimal under this construction, the number of iterations should be as small as possible on the premise of maximal branch number. Therefore it is necessary to investigate the bound on the branch number of resulting diffusion layer.

3.2 Upper Bound of the Branch Number

Before elaborating our main theorem, we need to introduce the following novel observation. Remember that in technical terms, the Fibonacci sequence F(n) is defined by the recurrence relation \(F(n + 2) = F(n + 1) + F(n)\), with seed values \(F(0) = 1\), \(F(1) = 1\).Footnote 1

Lemma 1

Assume \((0,\alpha )\) is the input of the structure shown in Fig. 1 such that \({w_b}(\alpha ) = 1\). Then the Hamming weight of the output of round i, \({w_b}({y_i})\), is upper bounded by the i-th number of the Fibonacci sequence:

$$\begin{aligned} {w_b}({y_i}) \le F(i). \end{aligned}$$
(6)
Fig. 2.
figure 2figure 2

Propagation of the Hamming weight on the input \((0,\alpha )\)

Proof

Let us illustrate it by mathematical induction. According to the output of the first three rounds (see Fig. 2), we know that \({w_b}({y_1}) = F(1)\), \({w_b}({y_2}) = F(2)\) and \({w_b}({y_3}) \le F(3)\).

Now suppose the induction hypothesis is true for round i, \(3 \le i < r\). Then we only need to prove \({w_b}({y_{i+1}}) \le F(i+1)\). Notice that the transformation of round \(i+1\) can be represented as

$$\begin{aligned} \left\{ \begin{array}{l} {y_{i+1,L}} = {P_{i+1}}({y_{i,L}}) \oplus {y_{i,R}} \\ {y_{i+1,R}} = {y_{i,L}} \\ \end{array} \right. \end{aligned}$$

and we always have \({w_b}({y_{i,L}}) = {w_b}({P_{i + 1}}({y_{i,L}}))\), which implies \({w_b}({y_{i+1,L}}) \le {w_b}({y_{i,L}}) + {w_b}({y_{i,R}})\).

Likewise, we obtain

$$\begin{aligned} {w_b}({y_{i+1,R}}) = {w_b}({y_{i,L}}) \le {w_b}({y_{i-1,L}}) + {w_b}({y_{i-1,R}}). \end{aligned}$$

Thus it holds

$$\begin{aligned} {w_b}({y_{i+1}})= & {} {w_b}({y_{i+1,L}}) + {w_b}({y_{i+1,R}}) \\\le & {} {w_b}({y_{i,L}}) + {w_b}({y_{i,R}}) + {w_b}({y_{i - 1,L}}) + {w_b}({y_{i - 1,R}}) \\\le & {} F(i) + F(i - 1) \\= & {} F(i+1), \end{aligned}$$

and we complete the proof.    \(\square \)

Theorem 1

The branch number of the diffusion layer constructed as in Fig. 1 satisfies

$$\begin{aligned} \mathcal {B}_d^{(r)} \le {\left\{ \begin{array}{ll} 2F\left( {\frac{{r + 1}}{2}} \right) &{} r\text { is odd} \\ F\left( {\frac{r}{2}} \right) + F\left( {\frac{r}{2} + 1} \right) &{} r\text { is even}, \end{array}\right. } \end{aligned}$$
(7)

where the superscript is used to emphasize the number of rounds in the proposed construction.

Proof

Our strategy is similar to the start-from-the-middle technique [26]: start with a particular state value at the middle round and then propagate forward and backward to the output and input of the Feistel structure respectively. Note that middle round means different positions depending on whether the number of rounds is odd or even.

Fig. 3.
figure 3figure 3

Upper bound of the branch number of proposed diffusion layers

When r is odd, let \((0,\alpha )\) be the input of round \((r + 1)/2\) such that \({w_b}(\alpha ) = 1\). For \(r = 1\), it is easy to see that \({w_b}({y_0}) + {w_b}({y_1}) = 2F(1)\). For \(r \ge 3\), both the forward and backward propagations begin with the same initial valueFootnote 2 \((\alpha ,0)\) and contain \((r + 1)/2-1\) rounds (see Fig. 3.(a)). According to Lemma 1, we obtain one input/output pair whose Hamming weight satisfies

$$\begin{aligned} {w_b}({y_0}) \le F\left( {\frac{{r + 1}}{2}} \right) , \quad {w_b}({y_r}) \le F\left( {\frac{{r + 1}}{2}} \right) , \end{aligned}$$

which implies the branch number of the resulting diffusion layer is at most \(2F((r + 1)/2)\).

When r is even, we change the target position to round \(r/2 + 1\), with the result that each direction consists of r / 2 rounds (see Fig. 3.(b)). Likewise it holds

$$\begin{aligned} {w_b}({y_0}) + {w_b}({y_r}) \le F\left( {\frac{{r}}{2} + 1} \right) + F\left( {\frac{{r}}{2}} \right) , \end{aligned}$$

since the inputs of the backward and forward direction are \((\alpha ,0)\) and \((0,\alpha )\) respectively. Hence we complete the proof.    \(\square \)

Remark 1

For an expected branch number, Theorem 1 gives insights on the lower bound on the number of rounds our construction should have, efficiently reducing a lot of unnecessary search works. As an illustration, we need at least 8 iterations in the Feistel structure to get a diffusion layer with \({\mathcal {B}_d} = 12\) due to \(\mathcal {B}_d^{(7)} \le 2F(4) = 10\).

3.3 Search Strategy for Software-Friendly Diffusion Layers

In this section, we will explain how to improve software performances of the proposed diffusion layers without loss of hardware efficiency. Compared with the bit permutation, cyclic shift is undoubtedly much more attractive as suitable rotation can be implemented as a single instruction on the corresponding processor. For example, while constructing a \(16 \times 16\) binary matrix with cyclic shift as round transformations, all operations of each round are based on 32-bit words on condition that 4-bit S-boxes are used. As a result, instead of bit permutation, cyclic shift is our first choice and the round function is afterwards restricted to \({P_i}(x) = x < < < {t_i}\), \(0 \le {t_i} < n/2\).

The pseudo-code of our basic search procedure is shown in Algorithm 1. The function BasicSearch (n, r, T, \(\mathbb {G}\)) returns all \(n \times n\) binary matrices with \({\mathcal {B}_d} \ge T\) constructed by r-round Feistel structure. Here \(\mathbb {G}\) denotes the set of transformation matrices that can be selected as round functions. On the basis of the above strategy, we initialize it to the set of matrices representing cyclic shift (which implies \(\left| \mathbb {G} \right| = n/2\)) and begin the first attempt with minimum possible r according to Theorem 1. If no optimal solution is found (i.e. \(\mathbb {E} = \emptyset \)), choose to increase r or relax restrictions on some round functions to continue searching, until suboptimal solutions are returned.

figure afigure a

Remark 2

“Optimal solutions” here refer to binary matrices with maximal \(\mathcal {B}_d\) constructed by the least possible number of cyclic shift operations. For instance, an \(8 \times 8\) matrix with \({\mathcal {B}_d} = 5\) constructed by 4 cyclic shifts is optimal solution owing to \(\mathcal {B}_d^{(3)} \le 4\). Moreover, “suboptimal solutions” have different forms since bigger r results in higher cost in hardware implementation, while enlarged \(\mathbb {G}\) (from the set of cyclic shift matrices to the one of permutation matrices) leads to the loss of advantages in software performance. Consequently, there are various trade-offs when we search for suboptimal solutionsFootnote 3.

Next, we introduce the following statement to relate certain matrices that lead to the same branch number.

Theorem 2

For any diffusion layer \(M = [{P_1},{P_2}, \ldots ,{P_r}]\) constructed by r-round Feistel structure, there always exists a corresponding \(M' = [I,{{P_2}'}, \ldots ,{{P_r}'}]\) such that \({\mathcal {B}_d}(M') = {\mathcal {B}_d}(M)\).

Proof

First of all, it is not difficult to see that we can place \(P_1\) after the XOR operation in round 1 as shown in Fig. 4.(b), since

$$\begin{aligned} {y_{1,L}} = {P_1}({x_L}) \oplus {x_R} = {P_1}({x_L} \oplus P_1^{ - 1}({x_R})). \end{aligned}$$

By using similar equivalent transforms, \(P_1\) can be moved to the end of the structure (see Fig. 4.(c) and (d)), with each round function redefined as \({{P_i}'} = {P_i} \cdot {P_1}\) (i is even) or \({{P_i}'} = P_1^{ - 1} \cdot {P_i}\) (i is odd). Then depending on whether the number of rounds is even or odd, it holds

$$\begin{aligned} {M} = \left( {\begin{array}{*{20}{c}} {{P_1}} &{} 0 \\ 0 &{} I \\ \end{array}} \right) \cdot {M'} \cdot \left( {\begin{array}{*{20}{c}} I &{} 0 \\ 0 &{} {P_1^{-1}} \\ \end{array}} \right) \ or\ {M} = \left( {\begin{array}{*{20}{c}} I &{} 0 \\ 0 &{} {{P_1}} \\ \end{array}} \right) \cdot {M'} \cdot \left( {\begin{array}{*{20}{c}} I &{} 0 \\ 0 &{} {P_1^{-1}} \\ \end{array}} \right) , \end{aligned}$$

respectively, which means M and \(M'\) are permutation homomorphic to each other. Thus their branch numbers are equal according to Proposition 1 and we complete the proof.    \(\square \)

Fig. 4.
figure 4figure 4

Equivalence partitioning of the proposed diffusion layers

Note that all matrices constructed by r-round Feistel structure can be classified according to any \(P_i\), \(i \in \{ 1, \ldots r\}\), although we simply choose \(P_1 = I\) in Theorem 2. In other words, each diffusion layer constructed with \(P_1 = I\) is a representative of an equivalence class, from which one can obtain all diffusion layers in the same equivalence class through analogous transforms mentioned above. We will make use of this property to reduce the search space by one round in the subsequent experiment.

4 Constructing Optimal Diffusion Layers with Feistel Structure

In this section, we will provide the results on constructing diffusion layers for various parameters. According to the search strategy, cyclic shift is preferred choice and for convenience, we abuse the symbol \(R_i\), \(i = 0, \ldots ,n/2-1\), to represent the matrix which corresponds to the transformation \({L}(x) = x < < < i\), where the size of x is n / 2. As an example, an \(8 \times 8\) matrix \(M = [{R_0},{R_3},{R_2},{R_1}]\) denotes the diffusion layer constructed by the following round functions:

$$\begin{aligned} {P_1} = \left( {\begin{array}{*{20}{c}} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 \\ \end{array}} \right) ,\ {P_2} = \left( {\begin{array}{*{20}{c}} 0 &{} 0 &{} 0 &{} 1 \\ 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 \\ \end{array}} \right) ,\ {P_3} = \left( {\begin{array}{*{20}{c}} 0 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 \\ 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 \\ \end{array}} \right) ,\ {P_4} = \left( {\begin{array}{*{20}{c}} 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 \\ 1 &{} 0 &{} 0 &{} 0 \\ \end{array}} \right) . \end{aligned}$$

Clearly, \(R_0\) is the representation of identity matrix and more \(R_0\)’s implies more efficient software implementation.

4.1 Diffusion Layers for \(n = 4\), 8, 16 and 32

First, we ran Algorithm 1 to search for optimal diffusion layers with commonly used dimensions in cryptography, that is, \(n = 2^k\), where \(k=2,3,4,5\). The total number of optimal solutions and typical instance of M we obtained are summarized in Table 1, accompanied by the cost in hardware implementation for each parameter. Notice that the branch number of \(32 \times 32\) binary matrices we can find is 12, which is consistent with the best known value.

Table 1. Experimental results for optimal diffusion layers with \(n = 4\), 8, 16, and 32

The time complexity of the exhaustive search for optimal solutions with given length n is \({(n/2)^r}\), where r is the least possible number of rounds to achieve the maximal \(\mathcal {B}_d\). Actually it took us less than 20 h to find all the best \(32 \times 32\) binary matrices through parallel search, and with the help of Theorem 2, we can further reduce the search time by a factor of \(2^4\). Additionally, the examples in Table 1 are perfect in the sense that they are constructed with the most possible number of \(R_0\)’s, cutting down on as many rotation instructions as possible in software implementation. Yet it is noteworthy that the technique of equivalence partitioning is not applicable in this case, as matrices in the same equivalence class are constructed by different sequences of round functions.

Moreover, we need to point out for any \(n \times n\) diffusion layer constructed by r-round Feistel structure, the number of XOR gates required for hardware implementation is nr / 2, which enjoys an overwhelming advantage even if we have not done any other optimization. For instance, although the total number of ones in any \(32 \times 32\) binary matrix with \(\mathcal {B}_d = 12\) is lower bounded by \(11 \times 32 = 352\), each of our optimal solutions can be implemented just with 128 XOR’s, which is the best result up to our knowledge.

4.2 Results for Other Parameters

To illustrate the generality of our proposal, we also search for diffusion layers with other sizes (n is even and \(n < 20\)), despite the fact that they are probably not often used. As shown in Table 2, we obtain optimal solutions for each given length, with the exception of \(n = 12\). In other words, no \(12 \times 12\) optimal binary matrix can be constructed by 6-round Feistel structure with cyclic shift as round functions. This means we need to consider searching for suboptimal solutions using the methods described in Sect. 3.3.

Laying particular stress on hardware efficiency, we hereby adopt the second strategy (i.e. enlarging \(\mathbb {G}\) for only one round function) and find 120 suboptimal solutions. One of the best results is

$$\begin{aligned} M_{12 \times 12} = [{R_5},{P_1},{R_4},{R_1},{R_1},{R_0}], \end{aligned}$$

where \(P_1\) is a permutation matrix. Due to the lack of space, we will give the concrete form of \(P_1\) and \(M_{12 \times 12}\) in Appendix B.

Table 2. Experimental results for diffusion layers with other interesting sizes

4.3 Other Information from the Proposed Structure

Below we elaborate how to acquire a prior knowledge of the resulting matrix. First, it is clear that any matrix constructed by r-round Feistel structure can be characterized as

$$\begin{aligned} M^{(r)} = \left( {\begin{array}{*{20}{c}} {{A_1}} &{} {{A_2}} \\ {{A_3}} &{} {{A_4}} \\ \end{array}} \right) , \end{aligned}$$
(8)

where each block matrix is an expression that consists of \(P_i\), \(1 \le i \le r\). For example, based on

$$\begin{aligned} M^{(3)} = \left( {\begin{array}{*{20}{c}} {{P_2}{P_1} \oplus I} &{} {\ \ {P_2}} \\ {{P_3}{P_2}{P_1} \oplus {P_3} \oplus {P_1}} &{} {\ \ {P_3}{P_2} \oplus I} \\ \end{array}} \right) , \end{aligned}$$

we get \(A_4 = {P_3}{P_2} \oplus I\) in this case. Let \(T(A_i)\) be the number of terms in \(A_i\) and through exploring the regularity on changes of \(T(A_i)\), we have

Theorem 3

The four block matrices constituting \(M^{(r)}\) as shown in (8) satisfy

$$\begin{aligned} T({A_1}) = F(r - 1), T({A_2}) = F(r - 2), T({A_3}) = F(r), T({A_4}) = F(r - 1). \end{aligned}$$

This observation is straightforward and we omit the proof here. It seems that Theorem 3 places major focus only on the expanded form, nevertheless, we will later see that it contributes to understanding the generic picture of optimal matrices.

As a matter of fact, each \(\frac{n}{2} \times \frac{n}{2}\) block matrix in the resulting matrix \(M_{n \times n}\) is a circulant matrix [27] for the optimal solution. To explain conveniently, we denote a \(t \times t\) circulant matrix with i ones in the first row by \(U_i^{(t)}\). Then each optimal solution in the case of \(n = 4\) satisfiesFootnote 4

$$\begin{aligned} {M_{4 \times 4}} = \left( {\begin{array}{*{20}{c}} {U_2^{(2)}} &{} \ {U_1^{(2)}} \\ {U_1^{(2)}} &{} \ {U_2^{(2)}} \\ \end{array}} \right) , \end{aligned}$$

since the terms in \(A_i\) are always eliminated pairwise during calculating. Taking \(n = 8\) as anther illustration, in the light of \(T({A_1}) = 3, T({A_2}) = 2, T({A_3}) = 5, T({A_4}) = 3\) and the fact that each row in an optimal solution has at least 4 ones, we conclude

$$\begin{aligned} {M_{8 \times 8}} = \left( {\begin{array}{*{20}{c}} {U_3^{(4)}} &{} \ {U_2^{(4)}} \\ {U_3^{(4)}} &{} \ {U_3^{(4)}} \\ \end{array}} \right) ,\ \ or \ \left( {\begin{array}{*{20}{c}} {U_3^{(4)}} &{} \ {U_2^{(4)}} \\ {U_1^{(4)}} &{} \ {U_3^{(4)}} \\ \end{array}} \right) , \end{aligned}$$

which is in accord with the experimental results. Specifically, there are only 8 instances with the latter form (among 32 optimal solutions) and one example will be introduced in Appendix C.

5 Searching for Involutory Optimal Diffusion Layers

In this section, we consider constructing involutory optimal matrix by Feistel structure with cyclic shift as round function, which enjoys an attractive advantage as it requires only one procedure to be implemented for the encryption and decryption. Notice that the search strategy is derived from Algorithm 1 and most instances come from the optimal solutions. For sizes that are often used in block ciphers, we list some examples in Table 3. Below are some explications of our experimental results:

  1. (1)

    For parameter \(n = 4\), the two optimal solutions we find themselves are involutory, however, the proportion (\(= 24/9760\)) is very low for \(n = 16\).

  2. (2)

    In the case of \(n = 32\), the largest branch number of the involutory matrices constructed by 8-round Feistel structure is 11. We do not search further (for involutory matrices with \(\mathcal {B}_d = 12\)) since the results are lightweight enough to have promising applications.

  3. (3)

    As for \(n = 8\), we do not obtain even one involutory instance despite of many attempts on the number of rounds. Before jumping to the full explanation of this situation, we need the following lemma.

Table 3. Experimental results for involutory diffusion layers with \(n = 4\), 16, and 32

Lemma 2

Assume M is an \(8 \times 8\) optimal matrix as shown in (8) where each \(A_i\), \(1 \le i \le 4\), is a circulant matrix. Then no \(A_i\) can be “0” or \(U_4^{(4)}\).

Proof

Suppose not, then two cases should be discussed:

  1. (a)

    Without loss of generality, we let \(A_1 = 0\), then \(A_3 = U_4^{(4)}\) since every column in M contains at least 4 ones, which implies M is singular and hence is a contradiction.

  2. (b)

    Similarly let \(A_1 = U_4^{(4)}\). Then it needs \(A_3 = U_3^{(4)}\) or \(A_3 = U_1^{(4)}\) to make M invertible. Note that in these cases, we have \(\mathcal {B}_d \le 4\) as there exists a vector \(\mathbf{{v}} = {(1,1,0, \ldots ,0)^T}\) with \({w_b}(\mathbf{{v}}) = 2\) such that \({w_b}(\mathbf{{v}}) + {w_b}(M \cdot \mathbf{{v}}) = 4\). This contradicts the optimality condition.

Consequently, there is no “0” or \(U_4^{(4)}\) among the four block matrices and we complete the proof.   \(\square \)

In addition, the following statement, deduced from [17], is very useful for our illustration.

Proposition 3

For any \(8 \times 8\) binary matrix with \(\mathcal {B}_d = 5\), if the rows have only two different Hamming weights and each contains half number of rows, then it must belong to one of the following cases:

  1. (1)

    the rows are of Hamming weight 4 and 5.

  2. (2)

    the rows are of Hamming weight 5 and 6.

Theorem 4

Any \(8 \times 8\) optimal diffusion layer constructed by r-round Feistel structure with cyclic shift as round functions can not be involutory.

Proof  Footnote 5. First, the conditions in Proposition 3 are always satisfied for every \(8 \times 8\) optimal matrix constructed by the structure as shown in Fig. 1. Furthermore, the four block matrices are all circulant matrices as explained before and according to Lemma 2 and Proposition 3, it is easy to see that the form of any resulting matrix belongs to one of the following eight cases:Footnote 6

$$\begin{aligned} \left( {\begin{array}{*{20}{c}} {{U_3}} &{} {{U_3}} \\ {{U_3}} &{} {{U_2}} \\ \end{array}} \right) ,\left( {\begin{array}{*{20}{c}} {{U_2}} &{} {{U_3}} \\ {{U_3}} &{} {{U_3}} \\ \end{array}} \right) ,\left( {\begin{array}{*{20}{c}} {{U_1}} &{} {{U_3}} \\ {{U_3}} &{} {{U_2}} \\ \end{array}} \right) ,\left( {\begin{array}{*{20}{c}} {{U_2}} &{} {{U_3}} \\ {{U_3}} &{} {{U_1}} \\ \end{array}} \right) , \end{aligned}$$
(9)
$$\begin{aligned} \left( {\begin{array}{*{20}{c}} {{U_3}} &{} {{U_3}} \\ {{U_2}} &{} {{U_3}} \\ \end{array}} \right) ,\left( {\begin{array}{*{20}{c}} {{U_3}} &{} {{U_2}} \\ {{U_3}} &{} {{U_3}} \\ \end{array}} \right) ,\left( {\begin{array}{*{20}{c}} {{U_3}} &{} {{U_2}} \\ {{U_1}} &{} {{U_3}} \\ \end{array}} \right) ,\left( {\begin{array}{*{20}{c}} {{U_3}} &{} {{U_1}} \\ {{U_2}} &{} {{U_3}} \\ \end{array}} \right) . \end{aligned}$$
(10)

For the sake of clarity, we simply denote the resulting matrix by

$$\begin{aligned} \tilde{M} = \left( {\begin{array}{*{20}{c}} A &{} B \\ C &{} D \\ \end{array}} \right) , \end{aligned}$$

and now suppose \(\tilde{M}\) is involutory. Then bases on \({{\tilde{M}}^2} = I\) and the property that multiplications here are commutative, we have

figure bfigure b

Next, we claim neither A nor D is equal to \(U_2\). Otherwise, one is singular and the other is invertible, since there is only one \(U_2\) among each of the eight matrices. This contradicts (14) and we thereby exclude all cases in (9).

For cases in (10), we suppose \(C = U_2\) without loss of generality. Then B is a nonsingular matrix and we have \(B^{-1} = U_1,\ or\ B^{-1} = U_3\).

Due to \(BC \ne 0\), we have \(A^2 \ne I\) from (11). Furthermore, as \(A = U_3\) and \({A^2} = ({U_4} \oplus {U_1})({U_4} \oplus {U_1}) = U_1^2\), it holds

$$\begin{aligned} {A^2} = \left( {\begin{array}{*{20}{c}} 0 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 1 \\ 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 \\ \end{array}} \right) . \end{aligned}$$

Then according to (11), we can easily obtain

$$\begin{aligned} C ={\left\{ \begin{array}{ll} (I \oplus {A^2}){U_1} &{} \ \ B^{-1} = U_1 \\ (I \oplus {A^2})({U_4} \oplus {U_1}) = (I \oplus {A^2}){U_1} &{} \ \ B^{-1} = U_3, \end{array}\right. } \end{aligned}$$

which implies the first and third columns in C are the same (the remaining two columns are also the same). Therefore, there always exists a vector \(\mathbf{{v}} = {(1,0,1,0, \ldots ,0)^T}\) with \({w_b}(\mathbf{{v}}) = 2\) such that \({w_b}(\mathbf{{v}}) + {w_b}(C \cdot \mathbf{{v}}) = 4\). This is a contradiction and all cases in (10) are thus excluded.

In summary, no involutory \(8 \times 8\) optimal binary matrix can be constructed by Feistel structure with cyclic shift as round functions and we complete the proof.    \(\square \)

6 Diffusion Layers Constructed with Generalized Feistel Structure

As explained in Sect. 3.3, we restrict the round function to cyclic shift with the purpose of improving software performance. However, an unfavourable situation we are likely to face is that the length of n / 2 words is longer than the word size of the processor. For example, in the case of 8-bit S-box, the software efficiency of our \(16 \times 16\) optimal diffusion layer on 32-bit processor is weaken since the rotation on a 64-bit word becomes complicated.

To make up for the above shortcomings, an instinctive idea is to construct diffusion layers using r-round Type-II Generalized Feistel Structure (GFS, [29]), which be characterized as

$$\begin{aligned} {M_{gfs}} = \left( {\begin{array}{*{20}{c}} 0 &{} 0 &{} 0 &{} I \\ I &{} 0 &{} 0 &{} 0 \\ 0 &{} I &{} 0 &{} 0 \\ 0 &{} 0 &{} I &{} 0 \\ \end{array}} \right) \left( {\begin{array}{*{20}{c}} {{P_{2r - 1}}} &{} I &{} 0 &{} 0 \\ 0 &{} 0 &{} I &{} 0 \\ 0 &{} 0 &{} {{P_{2r}}} &{} I \\ I &{} 0 &{} 0 &{} 0 \\ \end{array}} \right) \cdots \left( {\begin{array}{*{20}{c}} {{P_1}} &{} I &{} 0 &{} 0 \\ 0 &{} 0 &{} I &{} 0 \\ 0 &{} 0 &{} {{P_2}} &{} I \\ I &{} 0 &{} 0 &{} 0 \\ \end{array}} \right) \end{aligned}$$
(15)

where round functions in round i, \(P_{2i-1}\) and \(P_{2i}\), are cyclic left shifts. Also, we use \([{P_1},{P_2}, \ldots ,{P_{2r}}]\) to represent \(M_{gfs}\) for simplicity.

Owing to the slow diffusion property of Type-II GFS, one may take it for granted that the number of rounds will be increased compared with the Feistel structure on the premise of the same \(\mathcal {B}_d\). However, that is not the case according to our search results listed in Table 4. Actually, the conclusion of Theorem 1 also holds for Type-II GFS and hence the solutions we find are optimal in terms of the number of rounds.

Table 4. Experimental results for diffusion layers constructed by Type-II GFS

Compared to the Feistel structure, the cost of hardware implementation of each round in Type-II GFS remains unchanged, which means we can almost perfectly solve the problem introduced at the beginning of this section. Yet, it is to be noticed that the time complexity of r-round search becomes \({(n/4)^{2r}}\), far greater than \({(n/2)^r}\) when \(n > 8\). For example, while searching for \(32 \times 32\) binary matrix with \(\mathcal {B}_d = 11\) constructed by Type-II GFS, the total number of matrices to be detected is \(2^{48}\). Despite the help of equivalence partitioning technique, the search space (i.e. \(2^{42}\)) is still so huge that we need highly parallel computations to obtain all solutions.

7 Comparison with Known Results

In this section, we compare our Feistel-structure-based proposals with previous known results on hardware implementation. As can be seen in Table 5, solutions we found for \(n = 32\) can save approximately \(33.3\,\%\) gate equivalents compared to the best known result. Furthermore, while considering the constructions with \(\mathcal {B}_d = 11\), this improvement shots up to a staggering \(64.7\,\%\).

Table 5. Comparison of our diffusion layers with the known results

For the size \(n = 16\), a noteworthy comparison comes between the diffusion layer used in ARIA and ours. The area can be reduced by \(36.8\,\%\) provided that the original linear layer is replaced by our optimal instances. Moreover, in the case of \(n = 8\), the hardware cost of the design in [16] is equal to ours. The reason, which we have mentioned in the introduction, is that the examples given in [16] can be implicitly implemented by 4-round Feistel structure (while the last swap is not removed).

Here we omit comparisons on the software performance for two reasons. One is some of the previous constructions place the major focus on maximizing the branch number using algebraic methods, ignoring the estimate of implementation efficiency, and thus there is no need to make comparisons. The other is diffusion layers in cryptographic algorithms are often implemented together with S-boxes, which makes the comparisons complicated. Nevertheless, as explained in Sects. 3.3 and 6, our proposals still have excellent software performance even without any optimization.

8 Conclusion

In this paper, we propose a new class of optimal diffusion layers over \(\mathbb {F}_2\) by use of Feistel structure with bit permutation as round functions. Through investigating bounds on the branch number our structure can achieve, we construct optimal binary diffusion layers for a series of parameters (up to \(32 \times 32\)) with excellent software/hardware performances. As far as we know, the hardware cost of most proposals is the lowest compared to the previous results. Involutory optimal instances for the commonly used dimensions are also presented, with the exception of \(8 \times 8\). Finally, we investigate optimal diffusion layers constructed by Type-II GFS and provide some typical solutions. Since the hardware cost of our results are extremely low, we expect our strategy will be useful for future construction of lightweight ciphers based on (involutory) binary diffusion components.