Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One of the most widely used strategy to protect software implementations of blockciphers against side-channel attacks consists in applying secret sharing at the implementation level. This strategy also known as (higher-order) masking notably achieves provable security in the probing security model [ISW03] and in the noisy leakage model [PR13, DDF14]. While designing a higher-order masking scheme for a given blockcipher, the main issue is the secure and efficient computation of the s-box. Most of the proposed solutions (see for instance [RP10, CRV14, CPRR15]) are based on a polynomial representation of the s-box over the finite field \(\mathbb {F}_{2^n}\) (where n is the input bit-length), for which the field multiplications are secured using the ISW scheme due to Ishai et al. [ISW03].

An alternative approach has recently been put forward which consists in applying higher-order masking at the Boolean level by bitslicing the s-boxes within a cipher round [GLSV15, GR16]. In the bitsliced higher-order masking paradigm, the ISW scheme is applied to secure bitwise AND instructions which are significantly more efficient than their field-multiplication counterparts involved in polynomial schemes. Moreover, such a strategy allows to compute all the s-boxes within a cipher round at the same time, which results in important efficiency gains. To the best of our knowledge, bitsliced higher-order masking has only been applied to specific blockciphers up to now. In [GLSV15], Grosso et al. introduce new blockciphers with LS-designs tailored to efficient masked computation in bitslice. The approach has also been used by Goudarzi and Rivain in [GR16] to get fast implementations of two prominent blockciphers, namely AES and PRESENT, masked at an order up to 10. However, no generic method to apply this approach to any blockcipher has been proposed so far. In contrast several generic methods have been published for the polynomial setting [CGP+12, CRV14, CPRR15]. Therefore, and given the efficiency benefits of bitsliced higher-order masking approach, defining such a generic method is an appealing open issue.

Finding a Boolean representation of an s-box that yields an efficient computation in the bitsliced masking world merely consists in finding a circuit with low multiplicative complexity. The multiplicative complexity of Boolean functions has been studied in a few previous papers [MS92, BPP00, TP14]. In particular, optimal circuits have been obtained for some small (3-bit/4-bit/5-bit) s-boxes using SAT solvers [CMH13, Sto16]. However no general (heuristic) method has been proposed up to now to get an efficient decomposition for any s-box, and in particular for n-bit s-boxes with \(n\ge 6\).

In this paper, we introduce a new heuristic method to decompose an s-box into a circuit with low multiplicative complexity. Our proposed method follows the same approach as the CRV and algebraic decomposition methods used to get efficient representations in the polynomial setting [CRV14, CPRR15]. We also introduce the notion of parallel multiplicative complexity to capture the fact that several AND gates might be bundled in a single instruction, enabling further gain in the bitslice setting [GR16]. Eventually, we describe ARM implementations of bitsliced higher-order-masked s-box layers using our decomposition method and we compare them to optimized versions of the CRV and algebraic decomposition methods. Our results show a clear superiority of the bitslice approach when the masking order exceeds a certain threshold.

The paper is organized as follows. Section 2 gives some preliminaries about Boolean functions and higher-order masking. We then introduce the notion of (parallel) multiplicative complexity and discuss previous results as well as our contribution in Sect. 3. Section 4 presents our heuristic method in a general setting as well as some s-box-specific improvements. Finally, Sect. 5 describes our implementations and the obtained performances.

2 Preliminaries

2.1 Boolean Functions

Let \(\mathbb {F}_2\) denote the field with 2 elements and let n be a positive integer. A Boolean function f with n variables is a function from \(\mathbb {F}_2^n\) to \(\mathbb {F}_2\). The set of such functions is denoted \(\mathcal {F}_{n}\) in this paper. Any Boolean function \(f \in \mathcal {F}_{n}\) can be seen as a multivariate polynomial over \(\mathbb {F}_2[x_1,x_2,\dots ,x_n]/(x_1^2-x_1,x_2^2-x_2,\dots ,x_n^2-x_n)\):

$$\begin{aligned} f({x}) = \sum _{{u}\in \{0,1\}^n} a_{{u}} \, {x}^{{u}},\end{aligned}$$
(1)

where \({x} = (x_1,x_2,\dots ,x_n)\), \({x}^{{u}} = x_1^{u_1} \cdot x_2^{u_2} \cdot \ldots \cdot x_n^{u_n}\), and \(a_{{u}}\in \mathbb {F}_2\) for every \({u} \in \{0,1\}^n\). The above representation is called the Algebraic Normal Form (ANF).

For any family \(f_1,f_2\dots ,f_m \in \mathcal {F}_{n}\), the set \(\langle f_1,f_2\dots ,f_m\rangle = \big \{{\sum }_{i=0}^ma_if_i~|~a_i\in \mathbb {F}_2\big \}\) is called the span of the \(f_i\)’s (or the space spanned by the \(f_i\)’s), which is a \(\mathbb {F}_2\)-vector space. Let \(\mathcal {M}_n\) denote the set of monomial functions that is \(\mathcal {M}_n = \big \{ x \mapsto x^u \mid u \in \{0,1\}^n \big \}\). Then, the set of Boolean functions with n variables can be defined as the span of monomial functions, that is \(\mathcal {F}_{n}= \langle \mathcal {M}_n\rangle \).

Let n and m be two positive integers, and let S be a function mapping \(\mathbb {F}_2^n\) to \(\mathbb {F}_2^m\). Such a function can be seen as a vector of Boolean functions, i.e. \(S(x) = (f_1(x),f_2(x)\dots ,f_m(x))\) and is hence called a vectorial (Boolean) function also known as an (\(n\times m\)) s-box in cryptography. The Boolean functions \(f_1,f_2,\dots ,f_m\in \mathcal {F}_{n}\) are then called the coordinate functions of S.

2.2 Higher-Order Masking

Higher-order masking consists in sharing each internal variable x of a cryptographic computation into d random variables \(x_1\), \(x_2\), ..., \(x_d\), called the shares and satisfying \(x_1 + x_2 + \cdots + x_d = x\), for some group operation \(+\), such that any set of \(d-1\) shares is randomly distributed and independent of x. In this paper, the considered masking operation will be the bitwise addition. It has been formally demonstrated that in the noisy leakage model, where the attacker gets noisy information on each share, the complexity of recovering information on x grows exponentially with the number of shares [CJRR99, PR13]. This number d, called the masking order, is hence a sound security parameter for the resistance of a masked implementation.

The main issue while protecting a blockcipher implementation with masking is the secure computation of the nonlinear layer applying the s-boxes to the cipher state. The prevailing approach consists in working on the polynomial representation of the s-box over the field \(\mathbb {F}_{2^n}\), which is secured using the ISW scheme [ISW03] for the field multiplications [RP10, CGP+12]. The most efficient polynomial evaluation method in this paradigm is due to Coron et al. [CRV14]. The polynomial representation can also be decomposed in functions of lower algebraic degree as recently proposed by Carlet et al. in [CPRR15]. In the quadratic case, these functions can then be efficiently secured using the CPRR scheme [CPRR14].

2.3 Bitsliced Higher-Order Masking

A variant of polynomial methods is to apply masking at the Boolean level using bitslicing (see for instance [DPV01, GLSV15, BGRV15]). In [GR16], the authors apply this approach to get highly efficient implementations of AES and PRESENT with masking order up to 10. In their implementations, bitslice is applied at the s-box level. Specifically, based on a Boolean circuit for an s-box S, one can perform \(\ell \) parallel evaluations of S in software by replacing each gate of the circuit with the corresponding bitwise instruction, where \(\ell \) is the bit-size of the underlying CPU architecture. It results that the only nonlinear operations in the parallel s-box processing are bitwise AND instructions between \(\ell \)-bit registers which can be efficiently secured using the ISW scheme. Such an approach achieves important speedup compared to polynomial methods since (i) ISW-based ANDs are substantially faster than ISW-based field multiplications in practice, (ii) all the s-boxes within a cipher round are computed in parallel. The authors of [GR16] propose an additional optimization. In their context, the target architecture (ARM) is of size \(\ell = 32\) bits, whereas the number of s-boxes per round is 16 (yielding 16-bit bitslice registers). Therefore, they suggest to group the ANDs by pair in order to perform a single ISW-based 32-bit AND where the standard method would have performed two ISW-based 16-bit AND. This roughly decreases the complexity by a factor up to two.Footnote 1

3 Multiplicative Complexity of Boolean Functions

We shall call Boolean circuit any computation graph composed of \(\mathbb {F}_2\)-multiplication nodes (AND gates), \(\mathbb {F}_2\)-addition nodes (XOR gates), and switching nodes (NOT gates). Informally speaking, the multiplicative complexity of a Boolean function is the minimum number of \(\mathbb {F}_2\)-multiplication gates required by a Boolean circuit to compute it. This notion can be formalized as follows:

Definition 1

The multiplicative complexity \(C(f_1, f_2, \ldots , f_m)\) of a family of Boolean functions \(f_1\), \(f_2\), ..., \(f_m \in \mathcal {F}_{n}\), is the minimal integer t for which there exist Boolean functions \(g_i,h_i \in \mathcal {F}_{n}\) for \(i \in [\![1,t]\!]\) such that:

$$\begin{aligned} \left\{ \begin{array}{l} g_1, h_1 \in \langle 1, x_1, x_2, \ldots , x_n\rangle \\ \forall i \in [\![2,t]\!]: g_i, h_i \in \langle 1, x_1, x_2, \ldots , x_n, g_1 \cdot h_1, \ldots , g_{i-1} \cdot h_{i-1}\rangle \end{array}\right. \end{aligned}$$
(2)

and

$$\begin{aligned} f_1, f_2, \ldots , f_m \in \langle 1, x_1, x_2, \ldots , x_n, g_1 \cdot h_1, \ldots , g_t \cdot h_t\rangle . \end{aligned}$$
(3)

It is easy to see that any set of Boolean functions \(\{f_1, f_2, \ldots ,f_m\} \subseteq \mathcal {F}_{n}\) has multiplicative complexity satisfying

$$\begin{aligned} C(f_1, f_2, \ldots , f_m) \le C(\mathcal {M}_n) = 2^n - (n+1). \end{aligned}$$
(4)

Moreover, a counting argument shows that there exists \(f \in \mathcal {F}_{n}\) such that

$$\begin{aligned} C(f) > 2^\frac{n}{2} - n. \end{aligned}$$
(5)

In [BPP00], Boyar et al. provide a constructive upper bound for any Boolean function:

Theorem 1

([BPP00]). For every \(f \in \mathcal {F}_{n}\), we have

$$\begin{aligned} C(f) \le {\left\{ \begin{array}{ll}2^{\frac{n}{2}+1} - \frac{n}{2} - 2 &{} {if~n~is~even,}\\ 3 \cdot 2^{\frac{n-1}{2}} - \frac{n-1}{2} - 2 &{} {otherwise.}\\ \end{array}\right. } \end{aligned}$$
(6)

The particular case of Boolean functions with 4 and 5 variables has been investigated by Turan and Peralta in [TP14]. They give a complete characterization of affine-equivalence classes of these functions and they show that every \(f \in \mathcal {F}_4\) has \(C(f)\le 3\) and every \(f \in \mathcal {F}_5\) has \(C(f) \le 4\).

Other works have focused on the multiplicative complexity of particular kinds of Boolean functions. In [MS92], Mirwald and Schnorr deeply investigate the case of functions with quadratic ANF. In particular they show that such functions have multiplicative complexity at most \(\lfloor \frac{n}{2}\rfloor \). Boyar et al. give further upper bounds for symmetric Boolean functions in [BPP00].

3.1 Multiplicative Complexity of S-Boxes

The multiplicative complexity of an s-box \(S: x \mapsto (f_1(x),f_2(x)\dots ,f_m(x))\) is naturally defined as the multiplicative complexity of the family of its coordinate functions. We shall also call multiplicative complexity of a given circuit the actual number of multiplication gates involved in the circuit, so that the multiplicative complexity of a circuit gives an upper bound of the multiplicative complexity of the underlying s-box.

The best known circuit for the AES s-box in terms of multiplicative complexity is due to Boyar et al. [BMP13]. This circuit achieves a multiplicative complexity of 32 which was obtained by applying logic minimization techniques to the compact representation of the AES s-box due to Canright [Can05] (and saving 2 multiplications compared to the original circuit).

In [CMH13], Courtois et al. use SAT-solving to find the multiplicative complexity of small s-boxes. Their approach consists in writing the Boolean system obtained for a given s-box and a given (target) multiplicative complexity t as a SAT-CNF problem, where the unknowns of the system are the coefficients of the \(g_i\) and \(h_i\) in Definition 1. For each value of t, the solver either returns a solution or a proof that no solution exists, so that the multiplicative complexity is the first value of t for which a solution is returned. They apply this approach to find Boolean circuits with the smallest multiplicative complexity for a random \(3 \times 3\) s-box (meant to be used in CTC2 [Cou07]), the \(4 \times 4\) s-box of PRESENT, and for several sets of \(4 \times 4\) s-boxes proposed for GOST [PLW10]. These results have recently been extended by Stoffelen who applied the Courtois et al. approach to find optimal circuits for various \(4 \times 4\) and \(5 \times 5\) s-boxes [Sto16].

The main limitation of the SAT-solving approach is that it is only applicable to small s-boxes due to the combinatorial explosion of the underlying SAT problem, and getting the decomposition of an s-box of size e.g. \(n=8\) seems out of reach. Moreover, the method is not generic in the sense that the obtained decomposition stands for a single s-box and does not provide an upper bound for the multiplicative complexity of s-boxes of a given size.

3.2 Our Results

We give new constructive upper bounds for the multiplicative complexity of s-boxes. As a first result, we extend Theorem 1 to s-boxes (see proof in the full version):

Theorem 2

For every \(S \in \mathcal {F}_{n}^m\), we have:

$$\begin{aligned} C(S) \le \min _{k\in [\![1,n]\!]} (m 2^k + 2^{n - k} +k) - (m+n+1). \end{aligned}$$
(7)

When \(m=n\), the min is achieved by \(k = \big \lfloor \frac{n - \log _2 n}{2} \big \rceil \) for most \(n \in \mathbb {N}\), which gives \(C(S) \le B_n\) with

$$\begin{aligned} B_n \approx \sqrt{n}\, 2^{\frac{n}{2}+1} - \Big (\frac{3n + \log _2 n}{2} +1\Big ). \end{aligned}$$
(8)

We further introduce in this paper a heuristic decomposition method achieving lower multiplicative complexity. Our general result is summarized in the following Theorem:

Theorem 3

For every \(S \in \mathcal {F}_{n}^m\), we have \(C(S) \le C_{n,m}\) with

$$\begin{aligned} C_{n,m} \approx \sqrt{m} \, 2^{\frac{n}{2} +1} - m - n -1. \end{aligned}$$
(9)

And in particular

$$\begin{aligned} C_{n,n} = \left\{ \begin{array}{ll} 17 ~~&{} {for~n=5} \\ 31 &{} {for~n=6} \\ 50 &{} {for~n=7} \\ \end{array}\right. ~~~{and}~~~ C_{n,n} = \left\{ \begin{array}{ll} 77 ~~&{} {for~n=8} \\ 122 &{} {for~n=9} \\ 190 &{} {for~n=10} \\ \end{array}\right. \end{aligned}$$
(10)
Table 1. Multiplicative complexities of \(n\times n\) s-boxes.

In the above theorem, \(C_{n,m}\) denote the multiplicative complexity of the generic method presented in Sect. 4. We also propose non-generic improvements of this method that might give different results depending on the s-box. Table 1 summarizes the multiplicative complexities obtained by the two above theorems and the non-generic improved method for \(n\times n\) s-boxes with \(n \in [\![4,10]\!]\). For the latter, the figures represent what we hope to achieve for a random s-box (that we were able to achieve for some tested s-boxes).

3.3 Parallel Multiplicative Complexity

We introduce hereafter the notion of parallel multiplicative complexity for Boolean functions and s-boxes. We consider circuits with multiplication gates that can process up to k multiplications in parallel. The k-parallel multiplicative complexity of an s-box is the least number of k-parallel multiplication gates required by a circuit to compute it. We formalize this notion hereafter:

Definition 2

The k-parallel multiplicative complexity \(C^{(k)}(f_1, f_2, \ldots , f_m)\) of a family of Boolean functions \(f_1, f_2, \ldots ,f_m \in \mathcal {F}_{n}\), is the minimal integer t for which there exist Boolean functions \(g_i,h_i \in \mathcal {F}_{n}\) for \(i \in [\![1,tk]\!]\) such that:

$$\begin{aligned} \left\{ \begin{array}{l} g_1, h_1, g_2, h_2, \ldots , g_k,h_k \in \langle 1, x_1, x_2, \ldots , x_n\rangle ~,\\ \forall i \in [\![1,t-1]\!]:g_{ik+1}, h_{ik+1}, \ldots , g_{(i+1)k}, h_{(i+1)k} \\ ~~~~~~~~~~~~~~~~~~~~~~~~~~ \in \langle 1, x_1, x_2, \ldots , x_n, g_1 \cdot h_1, \ldots , g_{ik} \cdot h_{ik}\rangle \end{array}\right. \end{aligned}$$
(11)

and

$$\begin{aligned} f_1, f_2, \ldots , f_m \in \langle 1, x_1, x_2, \ldots , x_n, g_1 \cdot h_1, \ldots , g_{tk} \cdot h_{tk}\rangle . \end{aligned}$$
(12)

The main motivation for introducing this notion comes from the following scenario. Assume we want to perform m s-box computations in bitslice on an \(\ell \)-bit architecture, where \(\ell > m\). Then we have to pick a circuit computing the s-box and translate it in software by replacing Boolean gates with corresponding bitwise instructions. Since each bitsliced register contains m bits (m versions of the same input or intermediate bit), one can perform up to \(k = \lceil \ell / m \rceil \) multiplication gates with a single \(\ell \)-bit AND instruction (modulo some packing of the operands and unpacking of the results). If the used circuit has a k-parallel multiplicative complexity of t, then the resulting bitsliced implementation involve t bitwise AND instructions. This number of AND instructions is the main efficiency criterion when such an implementation is protected with higher-order masking which makes the k-parallel multiplicative complexity an important parameter for an s-box in this context.

The authors of [GR16] show that the AES circuit of Boyar et al. can be fully parallelized at degree 2, i.e. its 2-parallel multiplicative complexity is 16. In the full version of this paper, we further show that a sorted version of this circuit can achieve k-parallel multiplicative complexity of 9, 7 and 6 for \(k=4\), \(k=8\), and \(k=16\) respectively.

The decomposition method introduced in this paper has the advantage of being highly parallelizable. Table 2 summarizes the obtained k-parallel multiplicative complexity \(C^{(k)}_{n,n}\) for \(n\times n\) s-boxes for \(k\in \{2,4\}\). Note that we always have \(C^{(k)}_{n,n} \in \{\lceil \frac{C_{n,n}}{k} \rceil , \lceil \frac{C_{n,n}}{k} \rceil +1 \}\) which is almost optimal.

Table 2. Parallel multiplicative complexities of our method for \(n\times n\) s-boxes.

4 A Heuristic Decomposition for S-Boxes

In this section, we introduce a heuristic decomposition method for s-boxes that aim to minimize the number of \(\mathbb {F}_2\) multiplications. The proposed method follows the same approach than the CRV decomposition over \(\mathbb {F}_{2^n}[x]\). We first describe the proposed heuristic for a single Boolean function before addressing the case of s-boxes.

4.1 Decomposition of a Single Boolean Function

Let f be a Boolean function. The proposed decomposition simply consists in writing f as:

$$\begin{aligned} f(x) = \sum _{i=0}^{t-1}g_i(x)\cdot h_i(x) + h_t(x) \end{aligned}$$
(13)

where \(g_i,h_i\in \langle \mathcal {B}\rangle \), for some basis of functions \(\mathcal {B}=\{\phi _j\}_{j=1}^{|B|}\). Assume that all the \(\phi _j(x)\), \(\phi _j\in \mathcal {B}\), can be computed with r multiplications. Then the total multiplicative complexity of the above decomposition is of \(r+t\). We now explain how to find such a decomposition by solving a linear system.

Solving a Linear System. As in the CRV method, we first sample t random functions \(g_i\) from \(\langle \mathcal {B}\rangle \). This is simply done by picking \(t\cdot |\mathcal {B}|\) random bits \(a_{i,j}\) and setting \(g_i = \sum _{\phi _j\in \mathcal {B}}a_{i,j}\phi _j\). Then we search for a family of \(t+1\) Boolean functions \(\{h_i\}_i\) satisfying (13). This is done by solving the following system of linear equations over \(\mathbb {F}_2\):

$$\begin{aligned} A\cdot c = b \end{aligned}$$
(14)

where \(b = (f(e_1),f(e_2),\dots ,f(e_n))^\mathsf {T}\) with \(\{e_i\} = \mathbb {F}_2^n\) and where A is a matrix defined as the concatenation of \(t+1\) submatrices:

$$\begin{aligned} A = (A_{0}|A_1|~\cdots ~|~A_{t}) \end{aligned}$$
(15)

with

$$\begin{aligned} A_{i} = \begin{pmatrix}\phi _1(e_1)\cdot g_{i}(e_1)&{}\phi _2(e_1)\cdot g_{i}(e_1)&{}...&{}\phi _{|\mathcal {B}|}(e_1)\cdot g_{i}(e_1)\\ \phi _1(e_2)\cdot g_{i}(e_2)&{}\phi _2(e_2)\cdot g_{i}(e_2)&{}...&{}\phi _{|\mathcal {B}|}(e_2)\cdot g_{i}(e_2)\\ \\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \\ \phi _1(e_{2^{n}})\cdot g_{i}(e_{2^{n}})&{}\phi _2(e_{2^{n}})\cdot g_{i}(e_{2^{n}})&{}...&{}\phi _{|\mathcal {B}|}(e_{2^{n}})\cdot g_{i}(e_{2^{n}})\\ \end{pmatrix} \end{aligned}$$
(16)

for \(0 \le i\le t-1\), and

$$\begin{aligned} A_t = \begin{pmatrix}\phi _1(e_1)&{}\phi _2(e_1)&{}...&{}\phi _{|\mathcal {B}|}(e_1)\\ \phi _1(e_2)&{}\phi _2(e_2)&{}...&{}\phi _{|\mathcal {B}|}(e_2)\\ \\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \\ \phi _1(e_{2^{n}})&{}\phi _2(e_{2^{n}})&{}...&{}\phi _{|\mathcal {B}|}(e_{2^{n}})\\ \end{pmatrix} \end{aligned}$$
(17)

It can be checked that the vector c, solution of the system, gives the coefficients of the \(h_i\)’s over the basis \(\mathcal {B}\). A necessary condition for this system to have a solution whatever the target vector b (i.e. whatever the Boolean function f) is to get a matrix A of full rank. In particular, the following inequality must hold:

$$\begin{aligned} (t+1)|\mathcal {B}|\ge 2^n. \end{aligned}$$
(18)

Another necessary condition to get a full-rank matrix is that the squared basis \(\mathcal {B}\times \mathcal {B} = \{\phi _i\cdot \phi _k ~|~ \phi _i,\phi _k\in \mathcal {B}\}\) spans the entire space \(\mathcal {F}_{n}\). A classic basis of the vector space is the set of monomials \(\mathcal {M}_n\). Therefore, we suggest to take a basis \(\mathcal {B}\) such that \(\mathcal {M}_n \subseteq \mathcal {B}\times \mathcal {B}\). Let

$$\begin{aligned}&\mathcal {B}_0 = \{x\mapsto x^{{u}}, {u}\in \mathcal {U} \} \nonumber \\&\qquad \qquad ~ \text {with} ~~ \mathcal {U} = \big \{(u_1, \ldots , u_{\ell }, 0,\dots ,0) \big \} \cup \big \{(0,\dots ,0, u_{\ell +1}, \ldots , u_{n}) \big \} \end{aligned}$$
(19)

where \(\ell = \lceil \frac{n}{2} \rceil \) and where \(u_i \in \{0,1\}\) for every \(i\in [\![1,n]\!]\). Then, we clearly have \(\mathcal {B}_0\times \mathcal {B}_0 = \mathcal {M}_n\). We hence suggest taking \(\mathcal {B} \supseteq \mathcal {B}_0\), with \(\mathcal {B}\) possibly larger than \(\mathcal {B}_0\) since restraining ourselves to \(\mathcal {B} = \mathcal {B}_0\) could be non-optimal in terms of multiplications for the underlying decomposition method. Indeed, (18) shows that the more elements in the basis, the smaller t, i.e. the less multiplications \(g_i\cdot h_i\). We might therefore derive a bigger basis by iterating \(\mathcal {B} \leftarrow \mathcal {B}\cup \{\phi _j\cdot \phi _k\}\), where \(\phi _j\) and \(\phi _k\) are randomly sampled from \(\mathcal {B}\) until reaching a basis \(\mathcal {B}\) with the desired cardinality.

We then have \(r = |\mathcal {B}| - n - 1\), where we recall that r denotes the number of multiplications to derive \(\mathcal {B}\), and since \(x\mapsto 1,x\mapsto x_1,\dots ,x\mapsto x_n\in \mathcal {B}\) requires no multiplications. By construction, we have \(|\mathcal {B}|\ge |\mathcal {B}_0| = 2^{\ell }+2^{n-\ell }-1\), implying \(r\ge 2^{\ell }+2^{n-\ell }-(n+2)\). Let \(C_{n} = r+t\) denote the number of multiplications achieved by our decomposition method. Then, by injecting \(C_{n}\) in (18) we get:

$$\begin{aligned} (C_{n}-r+1)(n+1+r) \ge 2^n ~, \end{aligned}$$
(20)

that is:

$$\begin{aligned} C_{n} \ge r+\frac{2^n}{n+1+r}-1. \end{aligned}$$
(21)

It can be checked that the value of r minimizing the above bound is \(2^{\frac{n}{2}}-(n-1)\). However, r must satisfy \(r\ge 2^{\ell }+2^{n-\ell }-(n+2)\) where \(\ell = \lceil \frac{n}{2} \rceil \), which is always greater than \(2^{\frac{n}{2}}-(n-1)\) for \(n \ge 2\). That is why we shall define the optimal value of the parameter r (for the single-Boolean-function case) as:

$$\begin{aligned} r_{opt} = 2^{\ell }+2^{n-\ell }-(n+2) = \left\{ \begin{array}{ll} 2^{\frac{n}{2} +1} - (n+2) &{} \text {if}~n~\text {even,}\\ 3 \cdot 2^{\frac{n-1}{2}} - (n+2) ~&{} \text {if}~n~\text {odd,}\\ \end{array}\right. \end{aligned}$$
(22)

which amounts to taking \(\mathcal {B} = \mathcal {B}_0\). The corresponding optimal value for t is then defined as:

$$\begin{aligned} t_{opt} = \Big \lceil \frac{2^n}{r_{opt}+n+1}\Big \rceil -1 \end{aligned}$$
(23)

which gives \(t_{opt} \approx 2^{\frac{n}{2}-1}\) for n even, and \(t_{opt} \approx \frac{1}{3} \, 2^{\frac{n+1}{2}}\) for n odd.

Table 3. Optimal and achievable parameters for a single Boolean function.

In Table 3, we give the optimal values for (rt) as well as the corresponding size of the basis \(\mathcal {B}\) and multiplication complexity \(C_n\) for \(n \in [\![4,10]\!]\). We also give the parameter values that we could actually achieve in practice to get a full-rank system. We observe a small gap between the optimal and the achievable parameters, which results from the heuristic nature of the method (since we cannot prove that the constructed matrix A is full-rank).

4.2 S-Box Decomposition

Let \(S:x\mapsto (f_1(x),f_2(x),\dots ,f_m(x))\) be an s-box. We can apply the above heuristic to each of the m coordinate functions \(f_i\) to get a decomposition as follows:

$$\begin{aligned} f_i(x) = \sum _{j=0}^{t-1}g_{j}(x)\cdot h_{i,j}(x) + h_{i,t}(x),\end{aligned}$$
(24)

for \(1\le i\le m\). Here the \(g_j\)’s are randomly sampled from \(\langle \mathcal {B} \rangle \) until obtaining a full-rank system, which is then used to decompose every coordinate function \(f_i\). The total number of multiplications is \(C_{n,m} = r + m\cdot t\). Then, (18) gives:

$$\begin{aligned} C_{n,m} \ge r+ m\,\Big (\frac{2^n}{n+1+r}-1\Big ). \end{aligned}$$
(25)

It can be checked that the value of r minimizing the above bound is \(\sqrt{m2^n}-n-1\). We hence define

$$\begin{aligned} r_{opt} = \Big \lfloor \sqrt{m2^n} \Big \rceil -n-1, \end{aligned}$$
(26)

which minimizes (25) for every \(n \in [\![2,10]\!]\) and every \(m \in [\![1,n]\!]\). Moreover, this value satisfies the constraint (18) i.e. \(r_{opt}\ge 2^{\ell }+2^{n-\ell }-(n+2)\) for every \(m\ge 4\), and in practice we shall only consider s-boxes with \(m\ge 4\). The corresponding optimal value \(t_{opt}\) is then defined w.r.t. \(r_{opt}\) as in (23), which satisfies

$$\begin{aligned} t_{opt} = \Big \lceil \frac{2^{\frac{n}{2}}}{\sqrt{m}}\Big \rceil -1 \end{aligned}$$
(27)

for every \(n \in [\![2,10]\!]\) and every \(m \in [\![1,n]\!]\). We hence get

$$\begin{aligned} C_{n,m} \ge r_{opt} + m \cdot t_{opt} \approx \sqrt{m} \, 2^{\frac{n}{2}+1} - (n+m+1). \end{aligned}$$
(28)

In Table 4, we give the optimal values for the parameters (rt) as well as the corresponding size of the basis \(\mathcal {B}\) and multiplication complexity \(C_{n,n}\) for \(n\times n\) s-boxes with \(n \in [\![4,10]\!]\). We also give the parameter values that we could actually achieve in practice to get a full-rank system.

Table 4. Optimal and achievable parameters for an \(n\times n\) s-box.

In comparison to the single-Boolean-function case, the optimal size of the basis \(\mathcal {B}\) for the s-box decomposition is significantly bigger. This comes from the fact that a bigger basis implies a lower t for each of the m coordinate functions (i.e. decrementing t implies decreasing \(C_{n,m}\) by m). We also observe a very close gap (sometimes null) between the optimal and the achievable parameters. This tightness, compared to the single-Boolean-function case, is most likely due to the fact that we use a bigger basis.

4.3 Improvements

We present hereafter some improvements of the above method which can be applied to get a decomposition with better multiplicative complexity for a given s-box. In comparison to the above results, the obtained system and the associated multiplicative complexity depend on the target s-box and are not applicable to all s-boxes.

Basis Update. Our first improvement of the above method is based on a dynamic update of the basis, each time a coordinate function \(f_i(x)\) is computed.Footnote 2 Indeed, the term \(g_{j}(x)\cdot h_{i,j}(x)\) involved in the computation of \(f_i(x)\) can be reused in the computation of the following \(f_{i+1}(x)\), ..., \(f_{n}(x)\). In our decomposition process, this means that the \(g_{j}\cdot h_{i,j}\) functions can be added to the basis for the decomposition of the next coordinate functions \(f_{i+1}\), ..., \(f_{n}\). Basically, we start with some basis \(\mathcal {B}_1\supseteq \mathcal {B}_0\), where \(\mathcal {B}_0\) is the minimal basis as defined in (19). Then, for every \(i \ge 1\), we look for a decomposition

$$\begin{aligned} f_i(x) = \sum _{j=0}^{t_i - 1}g_{i,j}(x)\cdot h_{i,j}(x) + h_{i,t_i}(x), \end{aligned}$$
(29)

where \(t_i \in \mathbb {N}\) and \(g_{i,j}, h_{i,j} \in \langle \mathcal {B}_i \rangle \). Once such a decomposition has been found, we carry on with the new basis \(\mathcal {B}_{i+1}\) defined as:

$$\begin{aligned} \mathcal {B}_{i+1} = \mathcal {B}_i \cup \{g_{i,j}\cdot h_{i,j}\}_{j=0}^{t_i-1}.\end{aligned}$$
(30)

Compared to the former approach, we use different functions \(g_{i,j}\) and we get a different matrix A for every coordinate function \(f_i\). On the other hand, for each decomposition, the basis grows and hence the number \(t_i\) of multiplicative terms in the decomposition of \(f_i\) might decrease. In this context, we obtain a new condition for every i that is:

$$\begin{aligned} t_i\ge \frac{2^n}{|\mathcal {B}_i|}-1.\end{aligned}$$
(31)

The lower bound on \(t_i\) hence decreases as \(\mathcal {B}_i\) grows. The total multiplicative complexity of the method is then of:

$$\begin{aligned} C_{n,m}^* = r + \sum _{i=1}^mt_i, \end{aligned}$$
(32)

where \(r = |\mathcal {B}_1| -(n+1)\) is the number of multiplications required to derive the initial basis \(\mathcal {B}_1\). From the above inequality, we can define the optimal sequence of \(t_i\) and \(s_i=|\mathcal {B}_i|\) as:

$$\begin{aligned} t_1 = \psi _n(s_1) ~~\text {and}~~ {\left\{ \begin{array}{ll} s_{i+1} = s_i + t_i \\ t_{i+1} = \psi _n(s_{i+1}) \end{array}\right. } \text {for every i>1} \end{aligned}$$
(33)

where \(\psi _n:x\mapsto \Big \lceil \frac{2^n}{x}\Big \rceil -1 \). The sequence \((s_i,t_i)\) is fully determined by the cardinality of the original basis \(s_1 =|\mathcal {B}_1|\), and we have:

$$\begin{aligned} {\left\{ \begin{array}{ll} s_i = (\psi _n+\mathrm {Id})^{(i-1)}(s_1) \\ t_i = \psi _n\circ (\psi _n+\mathrm {Id})^{(i-1)}(s_1) \end{array}\right. } \end{aligned}$$
(34)

for every \(i\ge 1\), where \(\mathrm {Id}\) denote the identity function. The obtained optimal complexity is then:

$$\begin{aligned} C_{n,m}^*= & {} s_1 - (n+1) + \sum _{i=1}^m \psi _n \circ (\psi _n+\mathrm {Id})^{(i-1)}(s_1) \\= & {} (\psi _n+\mathrm {Id})^{(m)}(s_1) - (n+1) ~. \end{aligned}$$

By definition of \(\psi _n\), the obtained functions \((\psi _n+\mathrm {Id})^{(i)}\) are sums of continued fractions with ceiling, for which we do not have an analytic expression.

Table 5. Optimal parameters with basis-update improvement.

In Table 5, we give the optimal parameters \(s_1=|\mathcal {B}_1|\) and corresponding \(r = s_1 - (n+1)\), \(t_1\), \(t_2\), ..., \(t_n\), and \(C_{n,n}^*\) for \(n \times n\) s-boxes with \(n \in [\![4,10]\!]\). When the optimal multiplicative complexity is obtained for several values of \(s_1\), we give all the obtained set of parameters. We observe that the optimal multiplicative complexity is always achieved by starting with the minimal basis i.e. by taking \(\mathcal {B}_1 = \mathcal {B}_0\). It can also be obtained by taking \(s_1\) up to \(|\mathcal {B}_0| +3\) depending on the values of n.

The achievable counterpart of Table 5 only exists with respect to a given s-box since the functions \(g_{i,j} \cdot h_{i,j}\) added to the basis at each step depend on the actual s-box. But while focusing on a given s-box, we can still improve the method as we show hereafter.

Rank Drop. Our second improvement is based on the observation that even if the matrix A is not full-rank, the obtained system can still have a solution for some given s-box. Specifically, if A is of rank \(2^n-\delta \) then we should get a solution for one s-box out of \(2^\delta \) in average. Hence, instead of having \(t_i\) satisfying the condition \((t_i+1)|\mathcal {B}|\ge 2^n\), we allow a rank drop in the system of equations, by taking \(t_i\ge \frac{2^{n}-\delta }{|\mathcal {B}_i|} -1\) for some integer \(\delta \) for which solving \(2^\delta \) systems is affordable. We hence hope to get smaller values of \(t_i\) by trying \(2^{\delta }\) systems. Note that heuristically, we can only hope to achieve the above bound if \(\delta \) is (a few times) lower than the maximal rank \(2^n\) (e.g. \(\delta \le \frac{2^n}{4}\)). We can then define the (theoretical) optimal sequence \((s_i,t_i)\) and the corresponding multiplicative complexity \(C_{n,m}^*\) from \(s_1 = |\mathcal {B}_1|\) as in (33) and (35) by replacing the function \(\psi _n\) for \(\psi _{n,\delta }:x\mapsto \lceil \frac{2^n-\delta }{x} \rceil -1\). As an illustration, Table 6 provides the obtained parameters for a \(\delta \) up to 32. We see that the rank-drop improvement (theoretically) saves a few multiplications.

Table 6. Optimal parameters with basis-update and rank-drop improvements.

In practice, we observe that the condition \((t_i+1)|\mathcal {B}_i|\ge 2^n-\delta \) is not always sufficient to get a matrix A of rank \(2^n-\delta \). We shall then start with \(t_i=\psi _{n,d}(|\mathcal {B}_i|)\) and try to solve \(\alpha \cdot 2^{\delta }\) systems, for some constant \(\alpha \). In case of failure, we increment \(t_i\) and start again until a solvable system is found. The overall process is summarized in Algorithm 1.

figure a

The execution time of Algorithm 1 is dominated by the calls to a linear-solving procedure (Step 6). The number of trials is in \(o(n\,\alpha \,2^{\delta })\), where the constant in the \(o(\cdot )\) is the average incrementation of \(t_i\) (i.e. the average number of times Step 13 is executed per i). In our experiments, we observed that the optimal value of \(t_1 = \psi _{n,\delta } (s_1)\) is rarely enough to get a solvable system for \(f_1\). This is because we start with the minimal basis as in the single-Boolean-function case. We hence have a few incrementations for \(i=1\). On the other hand, the next optimal \(t_i\)’s are often enough or incremented a single time.

We used Algorithm 1 to get the decomposition of various \(n \times n\) s-boxes for \(n \in [\![4,8]\!]\), namely the eight \(4 \times 4\) s-boxes of Serpent [ABK98], the s-boxes \(S_5\) (\(5\times 5\)) and \(S_6\) (\(6\times 6\)) of SC2000 [SYY+02], the \(8 \times 8\) s-boxes \(S_0\) and \(S_1\) of CLEFIA [SSA+07], and the \(8 \times 8\) s-box of Khazad [BR00]. The obtained results are summarized in Table 7. Note that we chose these s-boxes to serve as examples for our decomposition method. Some of them may have a mathematical structure allowing more efficient decomposition (e.g. the CLEFIA \(S_0\) s-box is based on the inversion over \(\mathbb {F}_{256}\) and can therefore be computed with a 32-multiplication circuit as the AES).

Table 7. Achieved parameters for several s-boxes.

We observe that Algorithm 1 achieves improved parameters compared to the optimal ones with basis update and without the rank-drop improvement (see Table 5) for \(n \in \{4,5,6\}\). For \(n=8\), we only get parameters close to the optimal ones for the basis update (\(C^*_{n,n}= 62\) instead of 61). This can be explained by the fact that when n increases the value of \(\delta \) becomes small compared to \(2^n\) and the impact of exhaustive search is lowered. Thus Algorithm 1 can close the gap and (almost) achieve optimal parameters even in presence of a minimal starting basis, however it does not go beyond.

4.4 Parallelization

The proposed decomposition method is highly parallelizable. In practice, most SPN blockciphers have a nonlinear layer applying 16 or 32 s-boxes and most processors are based on a 32-bit or a 64-bit architecture. Therefore we shall focus our study on the k-parallel multiplicative complexity of our method for \(k \in \{2,4\}\).

General Method. In the general method (without improvement) described in Sect. 4.2, the multiplications between the \(g_j\)’s and the \(h_{i,j}\)’s can clearly be processed in parallel. Specifically, they can be done with exactly \(\lceil \frac{m \cdot t}{k} \rceil \) k-multiplications. The multiplications involved in the minimal basis \(\mathcal {B}_0 = \{x\mapsto x^{{u}}, {u}\in \mathcal {U} \}\) can also be fully parallelized at degree \(k=2\) and \(k=4\) for every \(n \ge 4\). In other words, the k-multiplicative complexity for deriving \(\mathcal {B}_0\) equals \(\lceil \frac{r_0}{k} \rceil \) for \(k \in \{2,4\}\) where \(r_0 = C(\mathcal {B}_0) = |\mathcal {B}_0| - (n+1)\) (see Sect. 4.1). One just has to compute \(x^u\) by increasing order of the Hamming weight of \(u \in \mathcal {U}\) (where \(\mathcal {U}\) is the set defined in (19)), then taking the lexicographical order inside a Hamming weight class. As an illustration, the 4-parallel evaluation of \(\mathcal {B}_0\) is given for \(n \in \{4,6,8\}\) in Table 8.

Table 8. Parallel evaluation of \(\mathcal {B}_0\) for \(n \in \{4,6,8\}\).

Once all the elements of \(\mathcal {B}_0\) have been computed, and before getting to the multiplicative terms \(g_j \cdot h_{i,j}\), we have to update it to a basis \(\mathcal {B} \supseteq \mathcal {B}_0\) with target cardinality (see Table 4). This is done by feeding the basis with \(|\mathcal {B}| - |\mathcal {B}_0|\) products of random linear combinations of the current basis. In order to parallelize this step, these new products are generated 4-by-4 from previous elements of the basis. We could validate that, by following such an approach, we still obtain full-rank systems with the achievable parameters given in Table 8. This means that for every \(n \in [\![4,10]\!]\), the k-multiplicative complexity of the general method is \(\lceil \frac{r}{k} \rceil + \lceil \frac{m \cdot t}{k} \rceil \). The obtained results (achievable parameters) are summarized in Table 9.

Table 9. Parallel multiplicative complexity of our general method an \(n\times n\) s-box.

Improved Method. The parallelization of the improved method is slightly more tricky since all the multiplicative terms \(g_{i,j} \cdot h_{i,j}\) cannot be computed in parallel. Indeed, the resulting products are fed to the basis so that they are potentially involved in the linear combinations producing the next functions \(g_{i+1,j}\), \(h_{i+1,j}\), ..., \(g_{m,j}\), \(h_{m,j}\). In order to fully parallelize our improved method we customize Algorithm 1 as follows. We keep a counter q of the number of products added to the basis. Each time a new \(f_{i+1}\) is to be decomposed, if the current counter q is not a multiple of k, then the first \(q_0\) products \(g_{i+1,j} \cdot h_{i+1,j}\) will be bundled with the last \(q_1\) products \(g_{i,j} \cdot h_{i,j}\) in the parallel version of our improved decomposition, where

$$\begin{aligned} \left\{ \begin{array}{l} q_0 = (k-q) \bmod k\\ q_1 = q \bmod k \end{array}\right. \end{aligned}$$
(35)

We must then ensure that the functions \(\{g_{i+1,j}, h_{i+1,j}\}_{j=0}^{q_0-1}\) are independent of the few last products \(\{g_{i,j} \cdot h_{i,j}\}_{j=t_i-q_1}^{t_i-1}\). This can be done at no cost for the \(g_{i+1,j}\)’s which can be generated without the last \(q_1\) products, and this add a constraint on the linear system for the first \(q_0\) searched \(h_{i+1,j}\) functions. However, we observed in our experiments that for small values of k such as \(k \in \{2,4\}\), this constraint has a negligible impact on Algorithm 1. We could actually obtain the exact same parameters than in Table 7 for all the tested s-boxes (Serpent, SC2000, CLEFIA, and Khazad) for a parallelization degree of \(k=2\), except for the s-box \(S_3\) of Serpent that requires 1 more multiplication.

5 Implementations

This section describes our implementations of a bitsliced s-box layer protected with higher-order masking based on our decomposition method. Our implementations evaluate 16 \(n\times n\) s-boxes in parallel where \(n\in \{4,8\}\), and they are developed in generic 32-bit ARM assembly. They take n input sharings \([{\varvec{x}}_1]\), \([{\varvec{x}}_2]\), ..., \([{\varvec{x}}_n]\) defined as

$$\begin{aligned}{}[{\varvec{x}}_i] = ({\varvec{x}}_{i,1},{\varvec{x}}_{i,2},\dots ,{\varvec{x}}_{i,d}) ~~\text {such that}~~\sum _{j=1}^d {\varvec{x}}_{i,j} = {\varvec{x}}_i \end{aligned}$$
(36)

where \({\varvec{x}}_i\) is a 16-bit register containing the i-th bit of the 16 s-box inputs. Our implementations then output n sharings \([{\varvec{y}}_0]\), \([{\varvec{y}}_1]\), ..., \([{\varvec{y}}_n]\) corresponding to the bitsliced output bits of the s-box. Since we are on a 32-bit architecture with 16-bit bitsliced registers, we use a degree-2 parallelization for the multiplications. Namely, the 16-bit ANDs are packed by pairs and replaced by 32-bit ANDs which are applied on shares using the ISW scheme as explained in [GR16].

The computation is then done in three stages. First, we need to construct the shares of the elements of the minimal basis \(\mathcal {B}_0\), specifically \([{\varvec{x}}^{u}]\) for every \(u \in \mathcal {U}\), where \({\varvec{x}}^{u}\) denote the bitsliced register for the bit \(x^u\), and where \(\mathcal {U}\) is the set defined in (19). This first stage requires \(r_0/2\) 32-bit ISW-ANDs, where \(r_0 = 2\) for \(n=4\) and \(r_0 = 22\) for \(n=8\) (see Table 8).

Once the first stage is completed, all the remaining multiplications are done between linear combinations of the elements of the basis. Let us denote by \([{\varvec{t}}_i]\) the sharings corresponding to the elements of the basis which are stored in memory. After the first stage we have \(\{[{\varvec{t}}_i]\} = \{[{\varvec{x}}^{u}] \mid u \in \mathcal {U}\}\). Each new \({\varvec{t}}_i\) is defined as

$$\begin{aligned} \Big (\sum _{j< i} a_{i,j} {\varvec{t}}_j\Big ) \odot \Big (\sum _{i<j} b_{i,j} {\varvec{t}}_j\Big ) \end{aligned}$$
(37)

where \(\odot \) denote the bitwise multiplication, and where \(\{a_{i,j}\}_j\) and \(\{b_{i,j}\}_j\) are the binary coefficients obtained from the s-box decomposition (namely the coefficients of the functions \(g_{i,j}\) and \(h_{i,j}\) in the span of the basis). The second stage hence consists in a loop on the remaining multiplications that

  1. 1.

    computes the linear-combination sharings \([{\varvec{r}}_i] = \sum _{j < i} a_{i,j} [{\varvec{t}}_j]\) and \([{\varvec{s}}_i] = \sum _{j < i} b_{i,j} [{\varvec{t}}_j]\)

  2. 2.

    refreshes the sharing \([{\varvec{r}}_i]\)

  3. 3.

    computes the sharing \([{\varvec{t}}_i]\) such that \({\varvec{t}}_i = {\varvec{r}}_i \odot {\varvec{s}}_i\)

where the last step is performed for two successive values of i at the same time by a call to a 32-bit ISW-AND. The sums in Step 1 are performed on each share independently. The necessity of the refreshing procedure in Step 2 is explained in [GR16] since an ISW multiplication of two linear combinations of the same sharings can introduce a security flaw (see for instance [CPRR14]). As in [GR16], this refreshing is implemented from an ISW multiplication with \((1, 0, 0, \ldots , 0)\).

Once all the basis sharings \([{\varvec{t}}_i]\) have been computed, the third stage simply consists in deriving each output sharing \([{\varvec{y}}_i]\) as a linear combination of the \([{\varvec{t}}_i]\), which is refreshed before being returned.

We compare our results with the optimized implementations from [GR16] of the CRV method [CRV14] and the algebraic decomposition (AD) method [CPRR15]. These implementations compute four s-boxes in parallel for \(n=8\) and eight s-boxes in parallel for \(n=4\) on a 32-bit ARM architecture. Table 10 summarizes the obtained performances in clock cycles with respect to the masking order d. It is worth noticing that packing and unpacking the bitslice registers for the parallelization of the ISW-ANDs implies a linear overhead in d. For \(d \in [\![2,20]\!]\), this overhead is between \(4\,\%\) and \(6\,\%\) of the overall s-box computations for \(n=8\), and between \(7\,\%\) and \(11\,\%\) for \(n=4\) (and this ratio is asymptotically negligible). For \(d=2\), the overhead slightly exceeds the gain, but for every \(d \ge 3\), parallelizing the ISW-ANDs always results in an overall gain of performances.

Table 10. Performances in clock cycles.

We observe that our implementations are asymptotically faster than the optimized implementations of CRV and AD methods (3.6 times faster for \(n=8\) and 5.7 times faster for \(n=4\)). However, we also see that the linear coefficient is significantly greater for our implementations, which comes from the computation of the linear combinations in input of the ISW-ANDs (i.e. the sharings \([{\varvec{r}}_i]\) and \([{\varvec{s}}_i]\)). As an illustration, Figs. 1 and 2 plots the obtained timings with respect to d. We see that for \(n=4\), our implementation is always faster than the optimized AD and CRV. On the other hand, for \(n=8\), our implementation is slightly slower for \(d \le 8\). We stress that our implementations could probably be improved by optimizing the computation of the linear combinations.

Fig. 1.
figure 1

Timings for \(n=8\).

Fig. 2.
figure 2

Timings for \(n=4\).

The RAM consumption and code size of our implementations are given in Table 11 and compared to those of the CRV and AD implementations from [GR16]. We believe these memory requirements to be affordable for not-too-constrained embedded devices. In terms of code size, our implementations are always the best. This is especially significant for \(n=8\) where CRV and AD needs a high amount of storage for the lookup tables of the linearized polynomials (see [GR16]). On the other hand, we observe a big gap between our implementations and those from [GR16] regarding the RAM consumption. Our method is indeed more consuming in RAM because of all the \([{\varvec{t}}_i]\) sharings that must be stored while such a large basis is not required for the CRV and AD methods, and because of some optimizations in the computation of the linear combinations (see the full version).

Table 11. Code sizes and RAM consumptions.