Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Side-Channel Attacks. Side-channel attacks belong to the genre of implementation attacks and exploit the fact thaevice performing a cryptographic algorithm leaks information related to the secret key through certain physical phenomena such as execution time, power consumption, EM radiation, etc. Depending on the source of the information leakage and the required post-processing, one can distinguish different categories of side-channel attacks, e.g. timing attacks, Simple Power Analysis (SPA) attacks, and Differential Power Analysis (DPA) attacks [KJJ99]. The former uses data-dependent (i.e. plaintext-dependent) variations in the execution time of a cryptographic algorithm to deduce information about the secret key involved in the computation of the ciphertext. In contrast, power analysis attacks require the attacker to measure the power consumption of a device while it executes a cryptographic algorithm [PMO07]. To perform an SPA attack, the attacker typically collects only one (or very few) power trace(s) and attempts to recover the secret key by focusing on differences between patterns within a trace. A DPA attack, on the other hand, requires many power traces and employs sophisticated statistical techniques to analyze differences between the traces [MOP07].

Even though DPA was first described using the DES algorithm as an example, it became soon clear that power analysis attacks can also be applied to break other secret-key algorithms, e.g. AES as well as public-key algorithms, e.g. RSA. A DPA attack normally exploits the principle of divide and conquer, which is possible since most block ciphers use the secret key only partially at a given point of time. Hence, the attacker can recover one part of the key at a time by studying the relationship between the actual power consumption and estimated power values derived from a theoretical model of the device. During the past 15 years, dozens of papers about successful DPA attacks on different implementations (hardware, software) of numerous secret-key cryptosystems (block ciphers, stream ciphers, keyed-hash message authentication codes) have been published. The experiments described in these papers confirm the real-world impact of DPA attacks in the sense that unprotected (or insufficiently protected) implementations of cryptographic algorithms can be broken in relatively short time using relatively cheap equipment.

The vast number of successful DPA attacks reported in the literature has initiated a large body of research on countermeasures. From a high-level point of view, countermeasures against DPA attacks can be divided into hiding (i.e. decreasing the signal-to-noise ratio) and masking (i.e. randomizing all the sensitive data) [MOP07]. Approaches to hiding-style countermeasures attempt to “equalize” the power consumption profile (i.e. making the power consumption invariant for all possible values of the secret key) or to randomize the power consumption so that a profile can no longer be correlated to any secret information. Masking, on the other hand, conceals every key-dependent intermediate result with a random value, the so-called mask, in order to break the dependency between the sensitive variable (i.e. involving the secret key) and the power consumption.

The Masking Countermeasure. Though masking is often considered to be less efficient (in terms of execution time) than hiding, it provides the key benefit that one can formally prove its security under certain assumptions on the device leakage model and the attacker’s capabilities. The way masking is applied depends on the concrete operations executed by a cipher. In general, logical operations (e.g. XOR, Shift, etc.) are protected using Boolean masking, whereas additions/subtractions and multiplications require arithmetic and multiplicative masking, respectively. When a cryptographic algorithm involves a combination of these operations, it becomes necessary to convert the masks from one form to the other in order to get the correct result. Examples of algorithms that perform both arithmetic (e.g. modular addition) and logical operations include two SHA-3 finalists (namely Blake and Skein) as well as all four stream ciphers in the eSTREAM software portfolio. Also, ARX-based block ciphers (e.g. XTEA [NW97] and Threefish) and the hash functions SHA-1 and SHA-2 fall into this category. From a design point of view, modular addition gives essential non-linearity with increased throughput and hence is used in several lightweight block ciphers e.g. SPECK [BSS+13]. Therefore, techniques for conversion between Boolean and arithmetic masks are of significant practical importance.

Conversion Between Boolean and Arithmetic Masking. At CHES 2001, Goubin described a very elegant algorithm for converting from Boolean masking to arithmetic masking, with only a constant number of operations, independent of the addition bit size k. Goubin also described an algorithm for converting from arithmetic to Boolean masking, but with \(\mathcal{O}(k)\) operations. A different arithmetic to Boolean conversion algorithm was later described in [CT03], based on precomputed tables; an extension was described in [NP04] to reduce the memory consumption. At CHES 2012, Debraize described a modification of the table-based conversion in [CT03], correcting a bug and improving time performances, still with asymptotic complexity \(\mathcal{O}(k)\).

Karroumi et al. recently noticed in [KRJ14] that Goubin’s recursion formula for converting from arithmetic to Boolean masking can also be used to compute an arithmetic addition \(z=x+y {\text { mod }}2^k\) directly with masked shares \(x=x_1 \oplus x_2\) and \(y=y_1 \oplus y_2\). The advantage of this method is that one doesn’t need to follow the three step process, i.e. converting x and y from Boolean to arithmetic masking, then performing the addition with arithmetic masks and then converting back from arithmetic to Boolean masks. The authors showed that this can lead to better performances in practice for the block cipher XTEA. However, as their algorithm is based on Goubin’s recursion formula, its complexity is still \(\mathcal{O}(k)\).

Conversion algorithms have recently been extended to higher-order countermeasure in [CGV14], based on Goubin’s conversion method. For security against any attack of order t, their solution has time complexity \(\mathcal{O}(n^2 \cdot k)\) for \(n=2t+1\) shares.

New Algorithms with Logarithmic Complexity. In this paper we describe a new algorithm for converting from arithmetic to Boolean masking with complexity \(\mathcal{O}(\log k)\) instead of \(\mathcal{O}(k)\). Our algorithm is based on the Kogge-Stone carry look-ahead adder [KS73], which computes the carry signal in \(\mathcal{O}(\log k)\) instead of \(\mathcal{O}(k)\) for the classical ripple carry adder. Following [BN05] and [KRJ14] we also describe a variant algorithm for performing arithmetic addition modulo \(2^k\) directly on Boolean shares, with complexity \(\mathcal{O}(\log k)\) instead of \(\mathcal{O}(k)\). We prove the security of our new algorithms against first-order attacks.

We also provide implementation results for our algorithms along with existing algorithms on a 32-bit microcontroller. Our results show that the new algorithms perform better than Goubin’s algorithm for \(k \ge 32\), as we obtain \(14\,\%\) improvement in execution time for \(k=32\), and \(23\,\%\) improvement for \(k=64\). We also describe our results for first-order secure implementations of HMAC-SHA-1 (\(k=32\)) and of the SPECK block-cipher (\(k=64\)).

2 Goubin’s Algorithms

In this section we first recall Goubin’s algorithm for converting from Boolean masking to arithmetic masking and conversely [Gou01], secure against first-order attacks. Given a k-bit variable x, for Boolean masking we write:

$$ x=x' \oplus r$$

where \(x'\) is the masked variable and \(r \leftarrow \{0,1\}^k\). Similarly for arithmetic masking we write

$$ x=A+ r {\text { mod }}2^k$$

In the following all additions and subtractions are done modulo \(2^k\), for some parameter k.

The goal of the paper is to describe efficient conversion algorithms between Boolean and arithmetic masking, secure against first-order attacks. Given \(x'\) and r, one should compute the arithmetic mask \(A=(x' \oplus r)-r {\text { mod }}2^k\) without leaking information about \(x=x' \oplus r\); this implies that one cannot compute \(A=(x' \oplus r)-r {\text { mod }}2^k\) directly, as this would leak information about the sensitive variable \(x=x' \oplus r\); instead all intermediate variables should be properly randomized so that no information is leaked about x. Similarly given A and r one must compute the Boolean mask \(x'=(A+r) \oplus r\) without leaking information about \(x=A+r\).

2.1 Boolean to Arithmetic Conversion

We first recall the Boolean to arithmetic conversion algorithm from Goubin [Gou01]. One considers the following function \(\varPsi _{x'}(r): {\mathbb F}_{2^k}\rightarrow {\mathbb F}_{2^k}\):

$$ \varPsi _{x'}(r) = (x' \oplus r)-r$$

Theorem 1

(Goubin [Gou01]). The function \( \varPsi _{x'}(r) = (x' \oplus r) -r \) is affine over \({\mathbb F}_2\).

Using this affine property, the conversion from Boolean to arithmetic masking is straightforward. Given \(x',r \in {\mathbb F}_{2^k}\) we must compute A such that \(x' \oplus r=A+r\). From the affine property of \( \varPsi _{x'}(r)\) we can write:

$$ A=(x' \oplus r)-r=\varPsi _{x'}(r)=\varPsi _{x'}(r \oplus r_2) \oplus \big (\varPsi _{x'}(r_2) \oplus \varPsi _{x'}(0) \big )$$

for any \(r_2 \in {\mathbb F}_{2^k}\). Therefore the technique consists in first generating a uniformly distributed random \(r_2\) in \({\mathbb F}_{2^k}\), then computing \(\varPsi _{x'}(r \oplus r_2)\) and \(\varPsi _{x'}(r_2) \oplus \varPsi _{x'}(0)\) separately, and finally performing XOR operation on these two to get A. The technique is clearly secure against first-order attacks; namely the left term \(\varPsi _{x'}(r \oplus r_2)\) is independent from r and therefore from \(x=x' \oplus r\), and the right term \(\varPsi _{x'}(r_2) \oplus \varPsi _{x'}(0)\) is also independent from r and therefore from x. Note that the technique is very efficient as it requires only a constant number of operations (independent of k).

2.2 From Arithmetic to Boolean Masking

Goubin also described in [Gou01] a technique for converting from arithmetic to Boolean masking, secure against first-order attacks. However it is more complex than from Boolean to arithmetic masking; its complexity is \(\mathcal{O}(k)\) for additions modulo \(2^k\). It is based on the following theorem.

Theorem 2

(Goubin [Gou01]). If we denote \(x'=(A+r) \oplus r\), we also have \(x'=A \oplus u_{k-1}\), where \(u_{k-1}\) is obtained from the following recursion formula:

$$\begin{aligned} \left\{ \begin{array}{l} u_0=0 \\ \forall k \ge 0, u_{k+1}=2 [u_k \wedge (A \oplus r) \oplus (A \wedge r) ] \end{array} \right. \end{aligned}$$
(1)

Since the iterative computation of \(u_i\) contains only XOR and AND operations, it can easily be protected against first-order attacks. We refer to Appendix A for the full conversion algorithm.

3 A New Recursive Formula Based on Kogge-Stone Adder

Our new conversion algorithm is based on the Kogge-Stone adder [KS73], a carry look-ahead adder that generates the carry signal in \(\mathcal{O}(\log k)\) time, when addition is performed modulo \(2^k\). In this section we first recall the classical ripple-carry adder, which generates the carry signal in \(\mathcal{O}(k)\) time, and we show how Goubin’s recursion formula (1) can be derived from it. The derivation of our new recursion formula from the Kogge-Stone adder will proceed similarly.

3.1 The Ripple-Carry Adder and Goubin’s Recursion Formula

We first recall the classical ripple-carry adder. Given three bits x, y and c, the carry \(c'\) for \(x+y+c\) can be computed as \(c'=(x \wedge y) \oplus (x \wedge c) \oplus (y \wedge c)\). Therefore, the modular addition of two k-bit variables x and y can be defined recursively as follows:

$$\begin{aligned} (x+y)^{(i)} = x^{(i)} \oplus y^{(i)} \oplus c^{(i)} \end{aligned}$$
(2)

for \(0 \le i <k\), where

$$\begin{aligned} \left\{ \begin{array}{l} c^{(0)} = 0 \\ \forall i \ge 1,\,c^{(i)} = (x^{(i-1)} \wedge y^{(i-1)}) \oplus (x^{(i-1)} \wedge c^{(i-1)}) \oplus (c^{(i-1)} \wedge y^{(i-1)}) \end{array} \right. \end{aligned}$$
(3)

where \( x^{(i)}\) represents the \(i^{\text {th}}\) bit of the variable x, with \(x^{(0)}\) being the least significant bit.

In the following, we show how recursion (3) can be computed directly with k-bit values instead of bits, which enables us to recover Goubin’s recursion (1). For this, we define the sequences \(x_j\), \(y_j\) and \(v_j\) whose \(j+1\) least significant bits are the same as x, y and c respectively:

$$\begin{aligned} x_j=\bigoplus \limits _{i=0}^j 2^ix^{(i)},~~ y_j=\bigoplus \limits _{i=0}^j 2^iy^{(i)},~~ v_j=\bigoplus \limits _{i=0}^j 2^ic^{(i)} \end{aligned}$$
(4)

for \(0 \le j \le k-1\). Since \(c^{(0)}=0\) we can actually start the summation for \(v_j\) at \(i=1\); we get from (3):

$$\begin{aligned} v_{j+1}= & {} \bigoplus \limits _{i=1}^{j+1} 2^i c^{(i)}\\ v_{j+1}= & {} \bigoplus \limits _{i=1}^{j+1} 2^i \left( (x^{(i-1)} \wedge y^{(i-1)}) \oplus (x^{(i-1)} \wedge c^{(i-1)}) \oplus (c^{(i-1)} \wedge y^{(i-1)}) \right) \\ v_{j+1}= & {} 2\bigoplus \limits _{i=0}^{j} 2^{i} \left( (x^{(i)} \wedge y^{(i)}) \oplus (x^{(i)} \wedge c^{(i)}) \oplus (c^{(i)} \wedge y^{(i)}) \right) \\ v_{j+1}= & {} 2 \big ((x_j \wedge y_j) \oplus (x_j \wedge v_j) \oplus (y_j \wedge v_j)\big ) \end{aligned}$$

which gives the recursive equation:

$$\begin{aligned} \left\{ \begin{array}{ll} v_0=0 \\ \forall j \ge 0,~v_{j+1}=2 \left( v_j \wedge (x_j \oplus y_j) \oplus (x_j \wedge y_j) \right) \end{array} \right. \end{aligned}$$
(5)

Therefore we have obtained a recursion similar to (3), but with k-bit values instead of single bits. Note that from the definition of \(v_j\) in (4) the variables \(v_j\) and \(v_{j+1}\) have the same least significant bits from bit 0 to bit j, which is not immediately obvious when considering only recursion (5). Combining (2) and (4) we obtain \(x_j+y_j=x_j \oplus y_j \oplus v_j\) for all \(0 \le j \le k-1\). For k-bit values x and y, we have \(x=x_{k-1}\) and \(y=y_{k-1}\), which gives:

$$ x+y=x \oplus y \oplus v_{k-1}$$

We now define the same recursion as (5), but with constant x, y instead of \(x_j\), \(y_j\). That is, we let

$$\begin{aligned} \left\{ \begin{array}{ll} u_0=0\\ \forall j \ge 0,~u_{j+1}=2 \left( u_j \wedge (x \oplus y ) \oplus (x \wedge y) \right) \end{array} \right. \end{aligned}$$
(6)

which is exactly the same recursion as Goubin’s recursion (1). It is easy to show inductively that the variables \(u_j\) and \(v_j\) have the same least significant bits, from bit 0 to bit j. Let us assume that this is true for \(u_j\) and \(v_j\). From recursions (5) and (6) we have that the least significant bits of \(v_{j+1}\) and \(u_{j+1}\) from bit 0 to bit \(j+1\) only depend on the least significant bits from bit 0 to bit j of \(v_j\), \(x_j\) and \(y_j\), and of \(u_j\), x and y respectively. Since these are the same, the induction is proved.

Eventually for k-bit registers we have \(u_{k-1}=v_{k-1}\), which proves Goubin’s recursion formula (1), namely:

$$ x+y =x \oplus y \oplus u_{k-1}$$

As mentioned previously, this recursion formula requires \(k-1\) iterations on k-bit registers. In the following, we describe an improved recursion based on the Kogge-Stone carry look-ahead adder, requiring only \(\log _2 k\) iterations.

3.2 The Kogge-Stone Carry Look-Ahead Adder

In this section we first recall the general solution from [KS73] for first-order recurrence equations; the Kogge-Stone carry look-ahead adder is a direct application.

General First-Order Recurrence Equation. We consider the following recurrence equation:

$$\begin{aligned} \left\{ \begin{array}{l} z_0=b_0 \\ \forall i \ge 1,~z_i=a_{i} z_{i-1} + b_{i} \end{array} \right. \end{aligned}$$
(7)

We define the function Q(mn) for \(m \ge n\):

$$\begin{aligned} Q(m,n)=\sum \limits _{j=n}^m \left( \prod _{i=j+1}^{m} a_i \right) b_j \end{aligned}$$
(8)

We have \(Q(0,0)=b_0=z_0\), \(Q(1,0)=a_1 b_0+b_1=z_1\), and more generally:

$$\begin{aligned} Q(m,0)= & {} \sum \limits _{j=0}^{m-1} \left( \prod _{i=j+1}^{m} a_i \right) b_j + b_m \\= & {} a_m \sum \limits _{j=0}^{m-1} \left( \prod _{i=j+1}^{m-1} a_i \right) b_j+b_m=a_m Q(m-1,0)+b_m \end{aligned}$$

Therefore the sequence Q(m, 0) satisfies the same recurrence as \(z_m\), which implies \(Q(m,0)=z_m\) for all \(m\ge 0\). Moreover we have:

$$\begin{aligned} Q(2m-1,0)= & {} \sum \limits _{j=0}^{2m-1} \left( \prod _{i=j+1}^{2m-1} a_i \right) b_j\\= & {} \left( \prod \limits _{j=m}^{2m-1} a_j\right) \sum \limits _{j=0}^{m-1} \left( \prod _{i=j+1}^{m-1} a_i \right) b_j+\sum \limits _{j=m}^{2m-1} \left( \prod _{i=j+1}^{2m-1} a_i \right) b_j \end{aligned}$$

which gives the recursive doubling equation:

$$Q(2m-1,0) = \left( \prod \limits _{j=m}^{2m-1} a_j\right) Q(m-1,0)+Q(2m-1,m) $$

where each term \(Q(m-1,0)\) and \(Q(2m-1,m)\) contain only m terms \(a_i\) and \(b_i\), instead of 2m in \(Q(2m-1,0)\). Therefore the two terms can be computed in parallel. This is also the case for the product \(\prod _{j=m}^{2m-1} a_j\) which can be computed with a product tree. Therefore by recursive splitting with N processors, the sequence element \(z_N\) can be computed in time \(\mathcal{O}(\log _2 N)\), instead of \(\mathcal{O}(N)\) with a single processor.

The Kogge-Stone Carry Look-Ahead Adder.

The Kogge-Stone carry look-ahead adder [KS73] is a direct application of the previous technique. Namely writing \(c_i=c^{(i)}\), \(a_i=x^{(i)} \oplus y^{(i)}\) and \(b_i=x^{(i)} \wedge y^{(i)}\) for all \(i \ge 0\), we obtain from (3) the recurrence relation for the carry signal \(c_i\):

$$ \left\{ \begin{array}{l} c_0=0\\ \forall i \ge 1,~c_i= (a_{i-1} \wedge c_{i-1}) \oplus b_{i-1} \end{array} \right. $$

which is similar to (7), where \(\wedge \) is the multiplication and \(\oplus \) the addition. We can therefore compute the carry signal \(c_i\) for \(0 \le i<k\) in time \(\mathcal{O}(\log k)\) instead of \(\mathcal{O}(k)\).

More precisely, the Kogge-Stone carry look-ahead adder can be defined as follows. For all \(0 \le j <k\) one defines the sequence of bits:

$$\begin{aligned} P_{0,j} = x^{(j)} \oplus y^{(j)},~~~ G_{0,j} = x^{(j)} \wedge y^{(j)} \end{aligned}$$
(9)

and the following recursive equations:

$$\begin{aligned} \left\{ \begin{array}{rcl} P_{i,j} &{} = &{} P_{i-1,j} \wedge P_{i-1,j-2^{i-1}} \\ G_{i,j} &{} = &{} (P_{i-1,j} \wedge G_{i-1,j-2^{i-1}}) \oplus G_{i-1,j} \end{array} \right. \end{aligned}$$
(10)

for \(2^{i-1} \le j < k\), and \(P_{i,j}=P_{i-1,j}\) and \(G_{i,j}=G_{i-1,j}\) for \(0 \le j<2^{i-1}\). The following lemma shows that the carry signal \(c_j\) can be computed from the sequence \(G_{i,j}\).

Lemma 1

We have \((x+y)^{(j)}= x^{(j)} \oplus y^{(j)} \oplus c_{j}\) for all \(0 \le j <k\) where the carry signal \(c_j\) is computed as \(c_0=0\), \(c_1=G_{0,0}\) and \( c_{j+1}=G_{i,j}\) for \(2^{i-1} \le j < 2^i\).

To compute the carry signal up to \(c_{k-1}\), one must therefore compute the sequences \(P_{i,j}\) and \(G_{i,j}\) up to \(i=\lceil \log _2 (k-1) \rceil \). For completeness we provide the proof of Lemma 1 in Appendix B.

3.3 Our New Recursive Algorithm

We now derive a recursion formula with k-bit variables instead of single bits; we proceed as in Sect. 3.1, using the more efficient Kogge-Stone carry look-ahead algorithm, instead of the classical ripple-carry adder for Goubin’s recursion. We prove the following theorem, analogous to Theorem 2, but with complexity \(\mathcal{O}(\log k)\) instead of \(\mathcal{O}(k)\). Given a variable x, we denote by \(x \ll \ell \) the variable x left-shifted by \(\ell \) bits, keeping only k bits in total.

Theorem 3

Let \(x,y \in \{0,1\}^k\) and \(n=\lceil \log _2 (k-1) \rceil \). Define the sequence of k-bit variables \(P_i\) and \(G_i\), with \(P_0=x \oplus y\) and \(G_0= x \wedge y\), and

$$\begin{aligned} \left\{ \begin{array}{rcl} P_i &{} = &{} P_{i-1} \wedge (P_{i-1} \ll 2^{i-1}) \\ G_i &{} = &{} \left( P_{i-1} \wedge (G_{i-1} \ll 2^{i-1}) \right) \oplus G_{i-1} \end{array} \right. \end{aligned}$$
(11)

for \(1 \le i \le n\). Then \( x+y =x \oplus y \oplus (2G_n)\).

Proof

We start from the sequences \(P_{i,j}\) and \(G_{i,j}\) defined in Sect. 3.2 corresponding to the Kogge-Stone carry look-ahead adder, and we proceed as in Sect. 3.1. We define the variables:

$$P_i := \sum \limits _{j=2^i-1}^{k-1} 2^j P_{i,j} \quad G_i := \sum \limits _{j=0}^{k-1} 2^j G_{i,j} $$

which from (9) gives the initial condition \(P_0=x \oplus y\) and \(G_0= x \wedge y\), and using (10):

$$\begin{aligned} P_{i}= & {} \sum \limits _{j=2^i-1}^{k-1} 2^j P_{i,j}=\sum \limits _{j=2^{i}-1}^{k-1} 2^j (P_{i-1,j} \wedge P_{i-1,j-2^{i-1}}) \\= & {} \left( \sum \limits _{j=2^i-1}^{k-1} 2^j P_{i-1,j} \right) \wedge \left( \sum \limits _{j=2^i-1}^{k-1} 2^j P_{i-1,j-2^{i-1}} \right) \end{aligned}$$

We can start the summation of the \(P_{i,j}\) bits with \(j=2^{i-1}-1\) instead of \(2^i-1\), because the other summation still starts with \(j=2^{i}-1\), hence the corresponding bits are ANDed with 0. This gives:

$$\begin{aligned} P_i= & {} \left( \sum \limits _{j=2^{i-1}-1}^{k-1} 2^j P_{i-1,j} \right) \wedge \left( \sum \limits _{j=2^i-1}^{k-1} 2^j P_{i-1,j-2^{i-1}} \right) \\= & {} P_{i-1} \wedge \left( \sum \limits _{j=2^{i-1}-1}^{k-1-2^{i-1}} 2^{j+2^{i-1}} P_{i-1,j} \right) =P_{i-1} \wedge (P_{i-1} \ll 2^{i-1}) \end{aligned}$$

Hence we get the same recursion formula for \(P_i\) as in (11). Similarly we have using (10):

$$\begin{aligned} G_i= & {} \sum \limits _{j=0}^{k-1} 2^j G_{i,j}=\sum \limits _{j=2^{i-1}}^{k-1} 2^j \left( (P_{i-1,j} \wedge G_{i-1,j-2^{i-1}}) \oplus G_{i-1,j} \right) + \sum \limits _{j=0}^{2^{i-1}-1} 2^j G_{i-1,j} \\= & {} \left( \sum \limits _{j=2^{i-1}}^{k-1} 2^j \left( P_{i-1,j} \wedge G_{i-1,j-2^{i-1}}\right) \right) \oplus G_{i-1} \\= & {} \left( P_{i-1} \wedge (G_{i-1} \ll 2^{i-1}) \right) \oplus G_{i-1} \end{aligned}$$

Therefore we obtain the same recurrence for \(P_i\) and \(G_i\) as (11). Since from Lemma 1 we have that \(c_{j+1}=G_{i,j}\) for all \(2^{i-1} \le j <2^i\), and \(G_{i,j}=G_{i-1,j}\) for \(0 \le j <2^{i-1}\), we obtain \(c_{j+1}=G_{i,j}\) for all \(0 \le j <2^i\). Taking \(i=n=\lceil \log _2 (k-1) \rceil \), we obtain \(c_{j+1}=G_{n,j}\) for all \(0 \le j \le k-2<k-1 \le 2^n\). This implies:

$$ \sum \limits _{j=0}^{k-1} 2^jc_j=\sum \limits _{j=1}^{k-1} 2^jc_j=2\sum \limits _{j=0}^{k-2} 2^jc_{j+1}=2\sum \limits _{j=0}^{k-2} 2^jG_{n,j}=2G_n$$

Since from Lemma 1 we have \( (x+y)^{(j)}=x^{(j)} \oplus y^{(j)} \oplus c_j\) for all \(0 \le j <k\), this implies \(x+y=x \oplus y \oplus (2G_n)\) as required. \(\square \)

The complexity of the previous recursion is only \(\mathcal{O}(\log k)\), as opposed to \(\mathcal{O}(k)\) with Goubin’s recursion. The sequence can be computed using the algorithm below; note that we do not compute the last element \(P_n\) since it is not used in the computation of \(G_n\). Note also that the algorithm below could be used as a \(\mathcal{O}(\log k)\) implementation of arithmetic addition \(z=x+ y {\text { mod }}{2^k}\) for processors having only Boolean operations.

figure a

4 Our New Conversion Algorithm

Our new conversion algorithm from arithmetic to Boolean masking is a direct application of the Kogge-Stone adder in Algorithm 1. We are given as input two arithmetic shares A, r of \(x = A + r {\text { mod }}{2^k}\), and we must compute \(x'\) such that \(x= x' \oplus r\), without leaking information about x.

Since Algorithm 1 only contains Boolean operations, it is easy to protect against first-order attacks. Assume that we give as input the two arithmetic shares A and r to Algorithm 1; the algorithm first computes \(P=A \oplus r\) and \(G=A \wedge r\), and after n iterations outputs \(x=A+r=A \oplus r \oplus (2G)\). Obviously one cannot compute \(P=A \oplus r\) and \(G=A \wedge r\) directly since that would reveal information about the sensitive variable \(x=A+r\). Instead we protect all intermediate variables with a random mask s using standard techniques, that is we only work with \(P'=P \oplus s\) and \(G'=G \oplus s\). Eventually we obtain a masked \(x'=x \oplus s\) as required, in time \(\mathcal{O}(\log k)\) instead of \(\mathcal{O}(k)\).

4.1 Secure Computation of AND

Since Algorithm 1 contains AND operations, we first show how to secure the AND operation against first-order attacks. The technique is essentially the same as in [ISW03]. With \(x=x' \oplus s\) and \(y=y' \oplus t\) for two independent random masks s and t, we have for any u:

$$ (x \wedge y) \oplus u=\left( (x' \oplus s) \wedge (y' \oplus t) \right) \oplus u=(x' \wedge y') \oplus (x' \wedge t) \oplus (s \wedge y') \oplus (s \wedge t) \oplus u $$
figure b

We see that the SecAnd algorithm requires 8 Boolean operations. The following Lemma shows that the SecAnd algorithm is secure against first-order attacks.

Lemma 2

When s, t and u are uniformly and independently distributed in \({\mathbb F}_{2^k}\), all intermediate variables in the SecAnd algorithm have a distribution independent from x and y.

Proof

Since s and t are uniformly and independently distributed in \({\mathbb F}_{2^k}\), the variables \(x'=x \oplus s\) and \(y'=y \oplus t\) are also uniformly and independently distributed in \({\mathbb F}_{2^k}\). Therefore the distribution of \(x' \wedge y'\) is independent from x and y. The same holds for the variables \(x' \wedge t\), \(s \wedge y'\) and \(s \wedge t\). Moreover since u is uniformly distributed in \({\mathbb F}_{2^k}\), the distribution of \(z'\) from Line 1 to Line 4 is uniform in \({\mathbb F}_{2^k}\); hence its distribution is also independent from x and y. \(\square \)

4.2 Secure Computation of XOR

Similarly we show how to secure the XOR computation of Algorithm 1. With \(x=x' \oplus s\) and \(y=y' \oplus u\) where s and u are two independent masks, we have:

$$ (x \oplus y) \oplus s=x' \oplus s \oplus y' \oplus u \oplus s=x' \oplus y' \oplus u$$
figure c

We see that the SecXor algorithm requires 2 Boolean operations. The following Lemma shows that the SecXor algorithm is secure against first-order attacks. It is easy to see that all the intermediate variables in the algorithm are uniformly distributed in \({\mathbb F}_{2^k}\), and hence the proof is straightforward.

Lemma 3

When s and u are uniformly and independently distributed in \({\mathbb F}_{2^k}\), all intermediate variables in the SecXor algorithm have a distribution independent from x and y.

4.3 Secure Computation of Shift

Finally we show how to secure the Shift operation in Algorithm 1 against first-order attacks. With \(x=x' \oplus s\), we have for any t:

$$ (x \ll j) \oplus t=\left( (x' \oplus s) \ll j \right) \oplus t=(x' \ll j) \oplus (s \ll j) \oplus t $$

This gives the following algorithm.

figure d

We see that the SecShift algorithm requires 4 Boolean operations. The following Lemma shows that the SecShift algorithm is secure against first-order attacks. The proof is straightforward so we omit it.

Lemma 4

When s and t are uniformly and independently distributed in \({\mathbb F}_{2^k}\), all intermediate variables in the SecShift algorithm have a distribution independent from x.

4.4 Our New Conversion Algorithm

Finally we can convert Algorithm 1 into a first-order secure algorithm by protecting all intermediate variables with a random mask; see Algorithm 5 below.

Since the \(\mathsf{SecAnd}\) subroutine requires 8 operations, the \(\mathsf{SecXor}\) subroutine requires 2 operations, and the SecShift subroutine requires 4 operations, lines 7 to 11 require \(2 \cdot 8+2 \cdot 4+2+2=28\) operations, hence \(28 \cdot (n-1)\) operations for the main loop. The total number of operations is then \(7+28 \cdot (n-1)+4+8+2+4=28 \cdot n-3\). In summary, for a register size \(k=2^n\) the number of operations is \(28 \cdot \log _2 k-3\), in addition to the generation of 3 random numbers. Note that the same random numbers s, t and u can actually be used for all executions of the conversion algorithm in a given execution. The following Lemma proves the security of our new conversion algorithm against first-order attacks.

Lemma 5

When r is uniformly distributed in \({\mathbb F}_{2^k}\), any intermediate variable in Algorithm 5 has a distribution independent from \(x=A+r {\text { mod }}2^k\).

Proof

The proof is based on the previous lemma for SecAnd, SecXor and SecShift, and also the fact that all intermediate variables from Line 2 to 5 and in lines 12, 13, 18, and 19 have a distribution independent from x. Namely \((A \oplus t) \wedge r\) and \(t \wedge r\) have a distribution independent from x, and the other intermediate variables have the uniform distribution. \(\square \)

figure e

5 Addition Without Conversion

Beak and Noh proposed a method to mask the ripple carry adder in [BN05]. Similarly, Karroumi et al. [KRJ14] used Goubin’s recursion formula (1) to compute an arithmetic addition \(z=x+y {\text { mod }}2^k\) directly with masked shares \(x' =x \oplus s\) and \(y'=y \oplus r\), that is without first converting x and y from Boolean to arithmetic masking, then performing the addition with arithmetic masks, and then converting back from arithmetic to Boolean masks. They showed that this can lead to better performances in practice for the block cipher XTEA.

In this section we describe an analogous algorithm for performing addition directly on the masked shares, based on the Kogge-Stone adder instead of Goubin’s formula, to get \(\mathcal{O}(\log k)\) complexity instead of \(\mathcal{O}(k)\). More precisely, we receive as input the shares \(x'\), \(y'\) such that \(x'=x \oplus s\) and \(y'=y \oplus r\), and the goal is to compute \(z'\) such that \(z'=(x+y) \oplus r\). For this it suffices to perform the addition \(z=x+y {\text { mod }}2^k\) as in Algorithm 1, but with the masked variables \(x'=x \oplus s\) and \(y'=y \oplus r\) instead of x, y, while protecting all intermediate variables with a Boolean mask; this is straightforward since Algorithm 1 contains only Boolean operations; see Algorithm 6 below.

figure f

As previously the main loop requires \(28 \cdot (n-1)\) operations. The total number of operations is then \(12+28 \cdot (n-1)+20=28 \cdot n+4\). In summary, for a register size \(k=2 ^n\) the number of operations is \(28 \cdot \log _2 k+4\), with additionally the generation of 2 random numbers; as previously those 2 random numbers along with r and s can be reused for subsequent additions within the same execution. The following Lemma proves the security of Algorithm 6 against first-order attacks. The proof is similar to the proof of Lemma 5 and is therefore omitted.

Lemma 6

For a uniformly and independently distributed randoms \(r \in \{0,1\}^k\) and \(s \in \{0,1\}^k\), any intermediate variable in the Kogge-Stone Masked Addition has the uniform distribution.

6 Analysis and Implementation

6.1 Comparison with Existing Algorithms

We compare in Table 1 the complexity of our new algorithms with Goubin’s algorithms and Debraize’s algorithms for various addition bit sizes k.Footnote 1 We give the number of random numbers required for each of the algorithms as well the number of elementary operations. Goubin’s original conversion algorithm from arithmetic to Boolean masking required \(5k+5\) operations and a single random generation. This was recently improved by Karroumi et al. down to \(5k+1\) operations [KRJ14]. The authors also provided an algorithm to compute first-order secure addition on Boolean shares using Goubin’s recursion formula, requiring \(5k+8\) operations and a single random generation. See Appendix A for more details. On the other hand Debraize’s algorithm requires \(19(k/\ell )-2\) operations with a lookup table of size \(2^\ell \) and the generation of two randoms.

Table 1. Number of randoms (rand) and elementary operations required for Goubin’s algorithms, Debraize’s algorithm and our new algorithms for various values of k.

We see that our algorithms outperform Goubin’s algorithms for \(k \ge 32\) but are slower than Debraize’s algorithm with \(\ell =8\) (without taking into account its pre-computation phase). In practice, most cryptographic constructions performing arithmetic operations use addition modulo \(2^{32}\); for example HMAC-SHA-1 [NIS95] and XTEA [NW97]. There also exists cryptographic constructions with additions modulo \(2^{64}\), for example Threefish used in the hash function Skein, a SHA-3 finalist, and the SPECK block-cipher (see Sect. 6.3).

6.2 Practical Implementation

We have implemented our new algorithms along with Goubin’s algorithms; we have also implemented the table-based arithmetic to Boolean conversion algorithm described by Debraize in [Deb12]. For Debraize’s algorithm, we considered two possibilities for the partition of the data, with word length \(\ell =4\) and \(\ell =8\). Our implementations were done on a 32-bit AVR microcontroller AT32UC3A0512, based on RISC microprocessor architecture. It can run at frequencies up to 66 MHZ and has SRAM of size 64 KB along with a flash of 512 KB. We used the C programming language and the machine code was produced using the AVR-GCC compiler with further optimization (e.g. loop unrolling). For the generation of random numbers we used a pseudorandom number generator based on linear feedback shift registers.Footnote 2

The results are summarized in Table 2. We can see that our new algorithms perform better than Goubin’s algorithms from \(k=32\) onward. When \(k=32\), our algorithms perform roughly \(14\,\%\) better than Goubin’s algorithms. Moreover, our conversion algorithm performs \(7\,\%\) better than Debraize’s algorithm (\(\ell =4\)). For \(k=64\), we can see even better improvement i.e., \(23\,\%\) faster than Goubin’s algorithm and \(22\,\%\) better than Debraize’s algorithm (\(\ell =4\)). On the other hand, Debraize’s algorithm performs better than our algorithms for \(\ell =8\) ; however as opposed to Debraize’s algorithm our conversion algorithm requires neither preprocessing nor extra memory.

Table 2. Number of clock cycles on a 32-bit processor required for Goubin’s conversion algorithm, Debraize’s conversion algorithm, our new conversion algorithm, Goubin’s addition from [KRJ14], and our new addition, for various arithmetic sizes k. The last two columns denote the precomputation time and the table size (in bytes) required for Debraize’s algorithm.

6.3 Application to HMAC-SHA-1 and SPECK

We have implemented HMAC-SHA-1 [NIS95] using the technique above to protect against first-order attacks, on the same microcontroller as in Sect. 6.2. To convert from arithmetic to Boolean masking, we used one of the following: Goubin’s algorithm, Debraize’s algorithm or our new algorithm. The results for computing HMAC-SHA-1 of a single message block are summarized in Table 3. For Debraize’s algorithm, the timings also include the precomputation time required for creating the tables. Our algorithms give better performances than Goubin and Debraize (\(\ell =4)\), but Debraize with \(\ell =8\) is still slightly better; however as opposed to Debraize, our algorithms do not require extra memory. For the masked addition (instead of conversions), the new algorithm performs 10 % better than Goubin’s algorithm.

Table 3. Running time in thousands of clock-cycles and penalty factor for HMAC-SHA-1 on a 32-bit processor. The last column denotes the table size (in bytes) required for Debraize’s algorithm.

SPECK is a family of lightweight block ciphers proposed by NSA, which provides high throughput for application in software [BSS+13]. The SPECK family includes various ciphers based on ARX (Addition, Rotation, XOR) design with different block and key sizes. To verify the performance results of our algorithms for \(k=64\), we used SPECK 128/128, where block and key sizes both equal to 128 and additions are performed modulo \(2^{64}\). We summarize the performance of all the algorithms in Table 4.

Table 4. Running time in clock-cycles and penalty factor for SPECK on a 32-bit processor. The last column denotes the table size (in bytes) required for Debraize’s algorithm.

As we can see our algorithms outperform Goubin and Debraize’s algorithm \((\ell =4)\), but not Debraize’s algorithm for \(\ell =8\), as for HMAC-SHA-1.