Keywords

1 Introduction

Side-channel analysis (SCA) poses a serious threat to any cryptographic implementation. If no dedicated countermeasure is applied, the secret of the underlying device can be easily extracted by SCA. A popular approach to increase the security of a cryptographic implementation is the use of masking. It is achieved by blinding the processed values by means of random masks [21] so that it should become impossible for an attacker to predict intermediate values.

To date there exist several types of masking schemes that differ in the level of abstraction and the target operation. In this work we focus on the techniques developed to be applied at algorithmic level, e.g., Boolean and arithmetic masking, which need to be adjusted according to the underlying cryptographic algorithm [21]. Note that nearly all proposed ciphers employ both logical and arithmetic operations.

As an example, ARX-based designs consist of three operations: integer addition, rotation, and XOR. Such constructions are the foundation for block ciphers (like FEAL [23] or Threefish [13]), stream ciphers (Salsa20 [5], ChaCha [4], HC-128 [35]) and hash functions (BLAKE [2], Skein [13]). There are further examples that also include a mixture of Boolean and arithmetic operations like the TEA family of block ciphers [34] and SHA-2 [30]. To realize a masked implementation of these constructions, one option is to employ both Boolean and arithmetic masking schemes. Rotation and XOR operations can be protected by Boolean masking, while arithmetic masking is advantageous for the addition operations. However, the required conversions between both operations can also be the target of an SCA attacker and hence need to be implemented in a secure way. In particular, many existing results discussing this method identified the conversion between arithmetic and Boolean masking as a major hurdle [3, 9, 14].

Related Works.

We now briefly highlight several works on the conversion between Boolean and arithmetic masking. The conversion techniques can be categorized into those which use precomputation [12] and those without precomputation [17]; however most of them were designed specifically for software platforms. Unfortunately, these constructions cannot be easily mapped to a dedicated hardware module without violating their claims on security. Roughly speaking, this is mainly due to critical glitches that occur inside masked circuits [22]. To avoid this problem, every step would need to be separated by a register stage which would be detrimental to the performance.

We like to remark that a hardware design for such conversions has been proposed in [15], but since both the mask and masked data are involved in the processes of the proposed techniques, such constructions are expected to still have first-order leakages (see [25]). Another problem is the transformation of conversion algorithms to higher orders. It has been shown in [10] how to secure the conversions against higher-order attacks, but this feature comes with a prohibitive overhead for any cryptographic implementation.

Along the same lines, in order to avoid the conversions a technique to securely perform modular arithmetic addition on Boolean masked operands has been introduced in [18]. However, this scheme has been developed to be used in software applications and cannot be easily applied on a hardware platform where performance is a key factor.

Recently, an approach was developed in [11] which uses the Kogge-Stone adder as a basis. But the conversion and masked addition requires more random bits compared to the solutions from [17, 18] and are only faster for larger bit sizes (i.e., 64 bits). Still their focus lies on software applications which makes them inefficient in hardware.

Contributions.

The target of this work is to design efficient hardware modules for modular addition of Boolean masked operands. More precisely, our goal is to develop a similar technique such as [18] for a hardware platform.

Since masked hardware designs face severe challenges due to glitches, we apply the concept of threshold implementations (TI) [29] that can satisfy the security requirements even in the presence of glitches. TI combines the ideas of Boolean secret sharing and multiparty computation. It has previously been applied to realize the secure hardware design of symmetric ciphers [6, 26, 31]. Although it has initially been developed with respect to first-order security, its extension to higher orders has been recently introduced [7].

In this paper we consider two factors to design the aforementioned module: (a) throughput and (b) SCA security order. With respect to performance (i.e., throughput) we consider two designs to implement a 32-bit arithmetic adder that is required by many cryptographic algorithms:

  1. 1.

    Ripple-Carry Adder (RCA) that requires 32 clock cycles to perform a complete addition, and

  2. 2.

    Kogge-Stone Adder (KSA) with 6 clock cycles latency and a fully pipelined architecture.

We present the first-order and (univariate) second-order secure threshold implementation of the two above mentioned designs. We show that our designs not only outperform the inefficient approaches of [10] but also reduce the number of fresh random mask bits required for each addition. We also present practical SCA evaluations performed on a Spartan-6 FPGA to confirm the claimed security levels. To the best of our knowledge, our four proposed architectures are the only available hardware-dedicated solutions that are supported by security proofs as well as by practical investigations.

2 Background

In this section, we introduce the used notations and present the basic ideas behind our designs.

2.1 Notations

In the following all equations are bit-level operations. An n-bit integer operand a is represented as \((a_{n-1}a_{n-2}\ldots a_1a_0)\) where \(a_0\) is the least significant bit. These integers are split up into shares of which the j-th share of a bit \(a_{i\in \{0,\ldots ,n-1\}}\) is denoted by \(a_i^j\). Inside the equations two Boolean operators are used: \(\oplus \) denotes the logical XOR and \(\wedge \) the logical AND. AND operators are always evaluated before any XOR operators.

2.2 Ripple-Carry Adder

In [18] the authors presented a way to securely add two Boolean masked values. Instead of three conventional steps (conversion, addition, reconversion), the addition can be implemented in just one step. Depending on the application, this can signficantly increase the performance over the classical approach. The algorithm introduced in [18] is based on a ripple-carry adder (RCA). This adder has been rewritten into a sequence of Boolean operations that take the Boolean masks into consideration. The algorithm is word-oriented for efficiency in software but not in hardware.

Similarly, our design is based on the basic algorithm described in [18]. The underlying algorithm builds on the fact that one bit of sum s can be computed as

$$\begin{aligned} s_i=a_i\oplus b_i\oplus c_i . \end{aligned}$$
(1)

Therefore, the addition is replaced by a simple XOR of the two operands a and b and the carry c. The only unknown part in such an equation is the carry bit which can be computed using a recursive formula

$$\begin{aligned} c_{i+1} = a_i\wedge b_i \oplus a_i\wedge c_i \oplus b_i\wedge c_i, \end{aligned}$$
(2)

where \(c_0=0\). The costly part of the RCA is the recursive carry computation. Its function has to be evaluated iteratively which leads to a high circuit depth in case of a fully combinatorial design.

2.3 Kogge-Stone Adder

Another addition circuit with a lower depth is given by the Kogge-Stone adder (KSA) [19] that splits the carry generation into generate (g) and propagate (p) functions. Instead of evaluating the carry function recursively, the KSA benefits from a tree-like structure and achieves a logarithmic complexity. For a hardware design, a KSA can significantly increase the overall performance.

Fig. 1.
figure 1

The structure of the carry generation for 4-bit operands using the KSA

The basic structure of KSA for \(n=4\)-bit operands is shown in Fig. 1. For operands a and b it computes the carry bits in three steps. During preprocessing the initial \(g_i\) and \(p_i\) values are generated as

$$\begin{aligned} g_i = a_i\wedge b_i, \quad p_i = a_i\oplus b_i. \end{aligned}$$
(3)

In the following stages a function is used to combine the g and p values of different bit positions. This function receives 4 bits as input and returns 2 output bits. For \(i>j\) the output values are computed as

$$\begin{aligned} g_{i:j} = g_i\oplus g_j\wedge p_i, \quad p_{i:j} = p_i\wedge p_j . \end{aligned}$$
(4)

After \(\log _2(n=4)=2\) stages, the computation is finished and all carry bits can be derived as

$$ c_{i\in \{2 \ldots n\}} = g_{i-1:0},\quad c_1 = g_0, \quad c_0 = 0. $$

Finally the sum s can be obtained according to Eq. (1).

2.4 Threshold Implementations

In order to realize secure masked implementations and avoid the leakage caused by glitches (e.g., [22, 25]), the threshold implementation (TI) scheme has been introduced and developed in [7, 2729]. Based on the algebraic degree t of the targeted non-linear function (Sbox) as well as the desired orderFootnote 1 of security d, the minimum number of input shares \(s_{in}\) and the minimum number of output shares \(s_{out}\) are defined as

$$ s_{in} = t \times d +1, \quad s_{out} = \left( {\begin{array}{c}s_{in}\\ t\end{array}}\right) . $$

The input x of the e.g., Sbox is represented by \((x^1,\ldots ,x^{s_{in}})\) in such a way that \(x=\bigoplus \limits _{i=1}^{s_{in}} x^i\). The output of the TI of the corresponding Sbox \((y^1,\ldots ,y^{s_{out}})\) should be also a shared representation of \(y=S(x)=\bigoplus \limits _{j=1}^{s_{out}} y^j\) while each \(y^j\) is provided by a component function \(f^j(\ldots )\) over a subset of input shares \((x^1,\ldots ,x^{s_{in}})\). This property is known as correctness while non-completeness is referred to the fact that any d (security order) selection of component functions \(f^1(\ldots ),\ldots ,f^{s_{out}}(\ldots )\) is independent of at least one input share. These two properties are relatively easy to achieve, but the third property uniformity is challenging. As the security of masking schemes is based on the uniform distribution of the masks, the output of a TI Sbox must be uniform as it is used as input in further parts of the implementation.

Suppose that for a certain input \(\mathrm {x}\) all possible sharings are given to the TI Sbox. The tuple \(\big (f^1(\ldots )\), ..., \(f^{s_{out}}(\ldots )\big )\) should be drawn uniformly from the set as all possible sharings of \(\mathrm {y}=S(\mathrm {x})\).

An important point is that the output of the component functions must be stored in dedicated registers to avoid the propagation of glitches. Another issue is related to the uniformity of the TI functions of security order \(d\,>\,1\). In such a case, the number of output shares \(s_{out}\) is usually higher than the number of input shares \(s_{in}\); hence uniformity cannot be achieved. Therefore, some of the registered output shares should be combined to reduce the number of output shares to \(s_{in}\) at most. After such a combination the uniformity can be examined. For more detailed information, the interested reader is referred to the original works [7, 29].

3 Implementation

We present two designs of a modulo \(2^{32}\) adder that provides resistance against first- and second-order SCA. This is a quite common type of addition used in many cryptographic algorithms (e.g., Salsa20, HC-128, SHA-2), but our architectures can be also easily adapted to other bit lengths.

3.1 Ripple-Carry Adder (First-Order SCA-Resistant)

Based on the scheme presented in Sect. 2.2 we build a first-order SCA-resistant adder. To achieve this, Eqs. (1) and (2) should be transformed to meet the three required TI properties.

Given that Eq. (2) is of degree 2, at least 3 shares (for input as well as for output) are necessary. It is supposed that each processed value, e.g., \(a_i\), is split into 3 shares as \((a^1_i,a^2_i,a^3_i)\). In case of Eq. (1), due to its linearity the shares are easily combined via XOR as

$$\begin{aligned} s_i^1 = a_i^1\oplus b_i^1\oplus c_i^1,\quad s_i^2 = a_i^2\oplus b_i^2\oplus c_i^2,\quad s_i^3 = a_i^3\oplus b_i^3\oplus c_i^3. \end{aligned}$$
(5)

As mentioned before, Eq. (2) is non-linear and has algebraic degree of 2. Following direct sharing approach represented in [8], we can construct a correct and uniform shared implementation of such a function. The shares of the carry bit can be computed as

$$\begin{aligned} c_{i+1}^1 = a_i^2\wedge b_i^2 \oplus a_i^2\wedge b_i^3 \oplus a_i^3\wedge b_i^2 \oplus a_i^2\wedge c_i^2 \oplus a_i^2\wedge c_i^3 \oplus a_i^3\wedge c_i^2 \oplus b_i^2\wedge c_i^2 \oplus b_i^2\wedge c_i^3 \oplus b_i^3\wedge c_i^2 \end{aligned}$$
(6)
$$\begin{aligned} c_{i+1}^2 = a_i^3\wedge b_i^3 \oplus a_i^3\wedge b_i^1 \oplus a_i^1\wedge b_i^3 \oplus a_i^3\wedge c_i^3 \oplus a_i^3\wedge c_i^1 \oplus a_i^1\wedge c_i^3 \oplus b_i^3\wedge c_i^3 \oplus b_i^3\wedge c_i^1 \oplus b_i^1\wedge c_i^3 \end{aligned}$$
(7)
$$\begin{aligned} c_{i+1}^3 = a_i^1\wedge b_i^1 \oplus a_i^1\wedge b_i^2 \oplus a_i^2\wedge b_i^1 \oplus a_i^1\wedge c_i^1 \oplus a_i^1\wedge c_i^2 \oplus a_i^2\wedge c_i^1 \oplus b_i^1\wedge c_i^1 \oplus b_i^1\wedge c_i^2 \oplus b_i^2\wedge c_i^1 \end{aligned}$$
(8)

Here we should note that Eqs. (1) and (2) can be seen as a function \(f:(a_i,b_i,c_i)\mapsto (s_i,c_{i+1})\). At a first glance one may think of examining the uniformity of the \((s_i,c_{i+1})\) tupleFootnote 2. However, such a tuple is never supplied to any function within the RCA algorithm. Note that \(s_i\) is an output bit and is not propagated while \(c_{i+1}\) is given to the next stage where it is combined with \(a_{i+1}\) and \(b_{i+1}\) which are independent of \(c_{i+1}\). Hence the uniformity of \(c_{i+1}\) suffices to fulfill the corresponding property.

During the implementation of such a design we encountered an issue that has never been reported before. The output of the shared carry computation function (Eqs. (6) to (8)) cannot be directly used as feedback signal since the output of a function from a previous cycle is used as input in the next clock cycle.

Fig. 2.
figure 2

Structure of the first-order secure adder based on RCA

As a remedy we constructed a two-stage design as depicted in Fig. 2. The three shares of the two operands a and b are stored in shift registers. The RCA algorithm and the deployment of shift registers supports an efficient scanning of operand bits. Two instances of the shared carry computation function are implemented whose outputs are stored in carry registers \(\mathtt {c_0}\) and \(\mathtt {c_1}\). The carry registers are enabled alternately while the other intermediate registers (\(\mathtt {c'_0}\), \(\mathtt {a'_0}\), \(\mathtt {a'_1}\), \(\mathtt {b'_0}\) and \(\mathtt {b'_1}\)) are enabled every second clock cycle synchronized with that of \(\mathtt {c_1}\). The operand registers are also shifted two bits every other second clock cycle. The additional registers, i.e., \(\mathtt {c'_0}\), \(\mathtt {a'_0}\), \(\mathtt {a'_1}\), \(\mathtt {b'_0}\), and \(\mathtt {b'_1}\) synchronize the computation of the sum bits, which need to be performed one clock cycle after that of the carry bits. Note that we use the shift register of operand a to save the result of the addition.

Another issue is related to the first stage, i.e., when \(i=0\). In our designs we suppose that input carry \(c_0\,=\,0\) so that \((c^1_0,c^2_0,c^3_0)\) should be a shared representation of 0. Therefore, both carry registers have to be initialized with a random set representing 0. In other words, our design requires four fresh mask bits \(fm_1\), \(\ldots \), \(fm_4\) only at the start of the addition to initialize \(\mathtt {c_0}\) and \(\mathtt {c_1}\) with \((fm_1,fm_2,fm_1\oplus fm_2)\) and \((fm_3,fm_4,fm_3\oplus fm_4)\) respectively. Note that all other stages of our design do not require fresh random bits leading to an efficient design with respect to the number of required fresh mask bits. For instance, our design is considerably more efficient than the solutions proposed in [10, 11, 18].

3.2 Ripple-Carry Adder (Second-Order SCA-Resistant)

The design described above can be simply transformed to support resistance against higher-order attacks. We now present a solution for the second-order resistant design. We increase the number of input shares \(s_{in}\) to 5 and all corresponding functions have to be chosen according to the principles of higher-order TI [7].

Equation (5) just needs to be adapted to the increased number of shares. The computation of the carry has to be split up into two steps. In the first step, \(s_{out}=10\) component functions generate 10 output shares. Following the same concept presented in [7], these intermediate shares are then again reduced to 5 shares in the second step.

No major changes to the basic structure depicted in Fig. 2 are necessary for implementation. Just the registers have to be adjusted to the increased number of shares and the F block is split by an additional register stage. All uniform equations for the F function, obtained by direct sharing, are described in detail in the appendix of this work.

As a consequence, the amount of utilized resources increases and the number of clock cycles needed for the carry computation doubles. Just as before, the carry registers need fresh randomness during the initialization. Therefore, the number of required fresh random masks increases to 8 bits.

3.3 Kogge-Stone Adder (First-Order SCA-Resistant)

The design based on the RCA has low requirements for space and randomness. However, the number of clock cycles for one addition grows linearly with the bit length of the operands. For increased performance we therefore implemented a design that uses a KSA as foundation and which is still secure against first-order attacks.

Equations (3) and (4) need to be split into shares. Since all the corresponding formulas are of degree two, similar to that of the RCA, at least 3 shares are required to realize a functional TI.

The two outputs of the preprocessing step are both given to the next stages; thus, the uniformity of each tuple \((g_i,p_i)\) must be taken into account. One part of Eq. (3) needs to be implemented by the AND of the two operands for which no uniform TI with 3 shares exists [29]. For this, fresh mask bits have to be used to make it uniform (see remasking in [8, 26]). In our design, we adopted the solution from [8] with only a single virtual share. One fresh random bit \(m_i\) is required for every invocation of the function in the preprocessing step:

$$\begin{aligned} g_{i}^1 = a_i^2\wedge b_i^2 \oplus a_i^2\wedge b_i^3 \oplus a_i^3\wedge b_i^2 \oplus m_i \end{aligned}$$
(9)
$$\begin{aligned} g_{i}^2 = a_i^3\wedge b_i^3 \oplus a_i^1\wedge b_i^3 \oplus a_i^3\wedge b_i^1 \oplus a_i^1\wedge m_i \oplus b_i^1\wedge m_i \end{aligned}$$
(10)
$$\begin{aligned} g_{i}^3 = a_i^1\wedge b_i^1 \oplus a_i^1\wedge b_i^2 \oplus a_i^2\wedge b_i^1 \oplus a_i^1\wedge m_i \oplus b_i^1\wedge m_i \oplus m_i. \end{aligned}$$
(11)

Further, preprocessing involves another linear XOR-function. We implement this part similar to Eq. (5). Both functions as well as their joint output \((g_i,p_i)\) fulfill the three TI properties.

The preprocessing step is followed by stages in which the g and p values are updated according to Eq. (4). These two functions can be considered as a 4-bit to 2-bit mapping. Similar to the preprocessing step, the tuple of the 2-bit output has to be considered for the uniformity check. For the computation of the g part of Eq. (4) (as an AND/XOR operation), we followed the direct sharing approach [8] and achieved:

$$\begin{aligned} g_{i:j}^1&= g_i^2 \oplus g_j^2\wedge p_i^2 \oplus g_j^2\wedge p_i^3 \oplus g_j^3\wedge p_i^2 \end{aligned}$$
(12)
$$\begin{aligned} g_{i:j}^2&= g_i^3 \oplus g_j^3\wedge p_i^3 \oplus g_j^1\wedge p_i^3 \oplus g_j^3\wedge p_i^1\end{aligned}$$
(13)
$$\begin{aligned} g_{i:j}^3&= g_i^1 \oplus g_j^1\wedge p_i^1 \oplus g_j^1\wedge p_i^2 \oplus g_j^2\wedge p_i^1. \end{aligned}$$
(14)

The other part (computation of p) of Eq. (4) can be implemented similar to Eq. (9). To reduce the amount of required fresh random bits, we replaced \(m_i\) with \(g_j^1\). This bit is not used in this equation and can take the role of a mask. Although our construction does not closely follow the assumptions in [8] considering the construction of virtual shares, we can demonstrate that this has no impact on security. Our simulation results show not only the uniformity of shared \(p_{i:j}\) but also the uniformity of the shared tuple \((p_{i:j},g_{i:j})\). We need to emphasize that due to the specific architecture of the KSA algorithm, \(g_j^1\) is only used once as a mask to introduce uniformity into the computation of a p. In other words, the mask bit \(g_j^1\) is never reused again what could potentially violate the uniformity in later stages.

Our design is optimized for maximum throughput by using a fully pipelined architecture. Figure 3 depicts the basic structure of our design. Since only the preprocessing step requires fresh random bits and – as stated above – all other stages are computed without additional mask, the total number of required fresh mask bits is \(n=32\). Compared to the other solutions like [10, 11, 18] this is still reasonable.

3.4 Kogge-Stone Adder (Second-Order SCA-Resistant)

Similar as for the RCA, the design based on the KSA is also easily extensible to higher orders. We outline the exemplary procedure for second-order security. In this case, the number of input shares is again set to 5. The four aforementioned equations are adjusted to meet the requirements of the second-order TI.

The XOR part of Eq. (3) is implemented as before but adapted to the higher number of shares. The AND part to compute \(g_i\) of Eq. (3) is split into two steps. As before, the first step results in 10 output shares and the second step merges the last 6 shares into a total number of 5 shares again. Furthermore, we have to use fresh masks to assure the uniformity. In this case, four fresh random bits are necessary. The two functions of the following stages are also split into two steps. For the computation of \(g_{i:j}\) of Eq. (4) we use the second-order TI representation of the AND/XOR function given in [7]. For the \(p_{i:j}\) part (the AND operation) we use the same construction of \(g_i\) of the preprocessing step. Instead of four fresh mask bits, we used 4 shares of \(g_j\) as fresh masks to reduce the required randomness. Details of the underlying uniform equations can be found in the appendix.

Fig. 3.
figure 3

Block diagram of the first-order secure adder based on KSA

The basic structure as shown in Fig. 3 is also the template for the architecture of most other parts. It has mainly to be adapted to 5 shares and the functions need to be split into two steps with a register in between. Hence the number of clock cycles for one addition is doubled. In terms of randomness the demand of our implementation quadruples, because each invocation of the AND operation in the preprocessing step requires 4 random bits.

3.5 Comparison

We now compare our designs in terms of size and performance that are implemented on a Spartan-6 FPGA with other solutions. All our findings are summarized in Table 1. In terms of size, the RCA-based variant is clearly superior to other solutions due to the iterative structure. On the contrary, the designs based on the KSA provide low latency and high-performance applications.

Due to the different implementation platforms, we cannot fairly compare our hardware designs with software-based solutions [10, 11, 18]. Therefore, Table 1 restricts the comparison to the number of required fresh random bits.

We can conclude that the RCA-based design is most efficient regarding the number of required random bits. The requirement of \(4\cdot d\) random bits outperforms all other proposals and is also independent of the operands bit length n. The approach based on KSA requires a higher number of random bits which also depends on n. Nevertheless, the first-order secure design uses the same amount of fresh masks as the solution of [18] and less than [11]. For higher orders it even outperforms the design of [10]. It is noteworthy that the number of fresh masks for d-order KSA with \(d\ge 2\) can be decreased even further. For \(d=1\) we can use the trick presented in [8] that requires only one fresh mask bit for an AND operation. Such a construction – with one virtual variable – might be also found for higher-order TI of the AND operation thereby reducing the number of required fresh mask bits.

4 Analysis

For the practical SCA evaluations we employed a SAKURA-G platform [1] populated with a Spartan-6 FPGA as target. All SCA traces have been collected by a digital oscilloscope while measuring the voltage drop over a \(1\,\mathrm {\Omega }\) resistor in Vdd path. In order to obtain clean signals and reduce the electrical noise, we used the embedded amplifier of the SAKURA-G and restricted the bandwidth of the oscilloscope to 20 MHz.

Table 1. Results and comparison of our hardware architectures

As evaluation metric we applied a non-specific statistical t-test [16]. The feature of this test is to indicate the existence of any leakage at a defined order in the power traces. Following the concept of non-specific t-test, power traces corresponding to fixed and randomly selected inputs are collected. Hence this scheme is also denoted as fixed vs. random t-test. During the measurements the fixed and random inputs need to be randomly interleaved. Then, the traces T are categorized into two groups \(G_1\) and \(G_2\) with respect to the fixed and random inputs, respectively. In the following explanation, we consider only one point of the collected traces for the sake of simplicity:

Recall that Welch’s (two-tailed) t-test is computed as

$$\begin{aligned} t=\frac{\mu (T\in G_1)-\mu (T\in G_2)}{\sqrt{\frac{\delta ^2(T\in G_1)}{|G_1|}+\frac{\delta ^2(T\in G_2)}{|G_2|}}}, \end{aligned}$$

where \(\mu \) and \(\delta ^2\) denote the sample mean and sample variance respectively, and |.| represents the cardinality. The t-test indeed examines the validity of the null hypothesis as the samples in both groups were drawn from the same population. If the null hypothesis is correct, it can be concluded with a high level of confidence that the device under test does not have any first-order leakage, given the recorded traces.

For such a conclusion the Student’s t-distribution function (in addition to the degree of freedom) is applied to determine the probability of rejecting the aforementioned hypothesis (cf. [7, 16]). For typical evaluations, a threshold for |t| as \(> 4.5\) is defined to reject the null hypothesis and indicate that a first-order attack is feasible. This process is repeated at each sample point independently to obtain a curve of t value.

The aforementioned scheme can be easily extended to higher orders by preprocessing the traces as, for example, centered square (for the second-order), standardized cube (for the third order), etc. It is noteworthy that the same evaluation scheme has been applied in [7, 20] to investigate the existence of first- and higher-order leakages. For detailed information on how to conduct the tests at higher orders the interested reader is referred to [33].

4.1 Ripple-Carry Adder

Now we analyze the security of our first-order SCA-resistant RCA design. A sample trace of such a design is shown in Fig. 4(a). In order to have a reference for the existing leakage in our platform as well as an evidence for the suitability of the applied evaluation scheme, we first turned the PRNG off that provides the randomness for initial sharing and fresh masks. Hence all outputs of the PRNG are set to zero and the underlying design receives unshared inputs as (a, 0, 0) and (b, 0, 0). With such a setting we collected \(100\,000\) traces corresponding to a mixture of fixed and random inputs. Therefore, we expect the t-test to report clearly exploitable first-order leakages, which is confirmed by the corresponding result shown in Fig. 4(b). It can be seen that the t value exceeds 400 during the cryptographic operation exceeding the defined threshold by far.

Fig. 4.
figure 4

RCA 1st order, t-test results using \(100\,000\,000\) traces

As the next step we activated the PRNG so that the adder circuit receives randomly shared inputs and fresh random masks. Hence the design is expected to provide first-order security. In order to examine this we collected \(100\,000\,000\) traces and performed the t-test up to third order. The corresponding results shown in Fig. 4 indicate the resistance of the design to first-order attacks and – as expected – its vulnerability to second- and third-order attacks.

Fig. 5.
figure 5

RCA 2nd order, t-test results using \(300\,000\,000\) traces

Fig. 6.
figure 6

KSA 1st order, t-test results using \(100\,000\,000\) traces

We continued our evaluation with the second-order-SCA-resistant RCA design with an active PRNG. Due to the high amount of randomness, i.e., fourth-order Boolean masking (five shares), exploiting a leakage from such a design needs a large number of traces. Therefore, following the above-explained procedure we collected \(300\,000\,000\) traces and ran the t-test evaluations. The results shown in Fig. 5 confirm the resistance of our design to first- and second-order attacks. Similar to the results of [7], the third-order leakage still cannot be detected, but we observe fourth- and fifth-order leakages as the design with five shares is expected to be vulnerable to a fifth-order attack.

4.2 Kogge-Stone Adder

Both analyses on the first- and second-order RCA were repeated on the first- and second-order SCA-resistant KSA designs. We even collected the same number of traces, i.e., \(100\,000\,000\) traces to evaluate the 1st-order KSA and \(300\,000\,000\) traces for the second-order KSA. The results which confirm the resistance of our constructions are shown in Figs. 6 and 7, respectively.

Fig. 7.
figure 7

KSA 2nd order, t-test results using \(300\,000\,000\) traces

4.3 Higher-Order Security

Recently, Reparaz published a note [32] on the security of higher-order threshold implementations. It states that when different intermediates values, i.e., shares, from different clock cycles are combined, a second-order TI might be vulnerable to the corresponding second-order attack. Although confirming this statement in general, we like to emphasize that this is not addressed in [7]. The idea behind higher-order TI is to resist against univariate higher-order attacks where the leakage of different points (of different clock cycles) are not combined. Hence, in the model of univariate higher-order attacks, all lemmas and proofs as given in [7] remain valid. Furthermore, this is backed by our practical investigations as shown above. Still we need to highlight that the second-order TI designs we presented in this work are designed to resist against univariate second-order attacks.

In this context, it has been previously shown in [24] that multivariate leakages can be easily summed up and be represented in a univariate form. The suggested approaches for such a combination include (a) running the target device at a relatively high frequency, e.g., 24 MHz–48 MHz, and (b) making use of a DC blocker and/or certain amplifiers in the measurement setup. Both techniques cause overlapping the power peaks of adjacent clock cycles, and hence the leakage associated to consecutive clock cycles are somehow added together. In [24] it has been shown that employing any of the aforementioned techniques causes an implementation of a univariate second-order resistant design to be vulnerable to a univariate second-order attack.

Fig. 8.
figure 8

(modified measurement setup) KSA 2nd order, t-test results using \(300\,000\,000\) traces

In order to examine the effect of such an issue on our second-order TI designs, we considered the second aforementioned technique. In other words, we employed a DC blocker (BLK-89-S+ from Mini-Circuits) and two serially conected AC amplifiers (ZFL-1000LN+ from Mini-Circuits) in the measurement setup. By means of this setup we repeated the same measurements and evaluations of our developed second-order Kogge-Stone Adder using the same number of \(300\,000\,000\) traces. We kept the measurement settings, e.g., sampling rate, bandwidth, and the target frequency of operation, the same as the last experiments. The results shown in Fig. 8 indeed practically confirm the note given in [32]. The second-order TI design demonstrates second-order leakages when the power peak of consecutive clock cycles are combined (by the measurement setup). Interestingly, by such a measurement setup the 4th-order and 5th-order analyses (in contrary to the previous experiment of Fig. 7) do not show a detectable leakage. We believe that it is due to the noise introduced by the measurement setup, i.e., overlapping the adjacent power peaks, which can certainly affect the feasibility of higher-order attacks.

5 Conclusion

In this paper, we presented two ways of performing addition on Boolean masked values that are secure against SCA attacks on a hardware platform. Compared to the KSA-based approach, the RCA-based solution is slower but requires less space and the least amount of random bits. In terms of performance, the design based on the KSA provides a suitable choice due to its pipelined architecture. In comparison to other already published algorithms, our approaches are able to match and even reduce the randomness requirements especially for higher orders. The resistance of both approaches has been verified by practical evaluations showing the security of our constructions. Our proposed designs enable an efficient and secure implementation of ARX-based designs in hardware which have not been fully investigated yet.