Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Adversaries aim at disclosingCarbone, Mathieu Tiran, Sébastien Ordas, Sébastien Agoyan, Michel Teglia, Yannick Ducharme, Gilles R. Maurine, Philippe secret information contained in integrated systems which are currently the main vector of data exchanges. One approach is Side Channel Analysis (SCA), which tries to reveal cryptographic keys by exploiting the information in one or several physical leakages of cryptographic devices, especially power consumption and electromagnetic emanations. In the seminal paper of [1], the difference of means was used as a distinguisher to identify from power consumption leakage the information about the key. Since then, more efficient distinguishers have been considered, notably Pearson’s correlation coefficient [2], leading to a SCA referred to as CPA, and the Mutual Information (MI) index, which appears as a promising alternative because it is capable of capturing any type of association. Mutual Information Analysis (MIA) in SCA was introduced in [3, 4] and much work has been devoted to investigate its potential in attacking cryptographic implementations, featuring various countermeasures and background noises in leakage traces [5, 6]. To summarize, MI was shown generally less efficient than Pearson’s coefficient when the leakage function is nearly linear, as is usually the case in unprotected devices [4, 7]. However, MIA appears promising when an adequate leakage profiling is a priori challenging [8, 9] or for attacking some protected devices [5, 6, 9, 10].

The main difficulty in implementing a MIA is that, in contrast to Pearson’s coefficient which is easily estimated via sample moments, the estimation of the MI index requires the estimation of a number of probability distribution functions (PDF) and this task is, both theoretically and practically, a difficult statistical problem. Further, it has been stated [6, 11, 12] that the choice of a good PDF estimator is crucial for the efficiency of a MIA. Thus, a variety of parametric (cumulant [9] or copula [13]) and nonparametric estimators (histograms [4], splines [14] and kernels [3, 8]) have been explored. Among the nonparametric methods, Kernel Density Estimators (KDE) [15, 16] have emerged in the statistical literature as one of the most popular approaches, in view of their many appealing theoretical properties. However, KDE involves a major technical difficulty because it requires the choice of a crucial tuning parameter, referred to as the bandwidth (see Sect. 3). There exists formulas for choosing in some optimal fashion this bandwidth for the problem of estimating a PDF. Unfortunately, formulas for the problem of estimating the MI index have not yet been developed. Thus, most MIA based on KDE (i.e. KDE-MIA) have taken the route of estimating the PDF using these formulas, the logic being that if all PDF are well estimated, plugging these estimates in the expression of the MI index should yield a good estimator. But these formulas, beside being based on an asymptotic argument (optimizing the trade-off between asymptotic bias and variance), are averages over the whole range of the PDF. Moreover they involve unknown quantities that in turn must be estimated. In practical situations, there is no guarantee that such average estimated values will yield globally good PDF estimators and it is often recommended that they be used as starting points in the estimation process. Thus, applying them in an automatic fashion amounts to using an unsharpened tool. All this is further compounded by the fact that in computing the MI index, many different PDF need to be simultaneously estimated and integrated over their range. As stated by [12], this may help inexplaining the often lower efficiency of a standard MIA, as compared to CPA.

In this paper, we develop a new approach that selects the bandwidth in KDE-MIA from the point of view of optimizing the quality of the attack regarding two criteria, namely efficiency and genericity, instead of aiming at the quality of the PDF estimates. Applying our approach to some data sets, the new MIA, referred to as ABS-MIA (ABS for Adaptive Bandwidth Selection), is much better than the standard MIA and can even compete favorably with CPA.

The paper is organized as follows. Section 2 briefly recalls the modus operandi of SCA attacks and introduces the basics of MIA. Section 3 presents the KDE. Section 4 motivates and presents our proposal. This is then applied in Sect. 5 to some data. Section 6 concludes the paper and discusses some extensions.

2 Side Channel Analysis: An Overview

SCA is based on the fact that the physical leakage emanating from secure devices contains information on secret keys, and that an adversary can retrieve such keys by relating this information to the known course of the cryptographic device. In practice, this is done by relating the leakage to intermediate values computed by the target device which depend on parts (e.g. sub-keys) of the secret key. The set \(\mathcal {K}\) of all candidate sub-keys \(k\) is assumed known and not too large. The secret sub-key targeted is noted \(\kappa \). The relation is typically achieved in three steps.

2.1 Device Observation

To implement a SCA, an adversary first observes the target device by feeding it with known messages \(m\) in a set \(\mathcal {M}\), while collecting the corresponding leakage traces \(\{o(m)=(o_{1}(m),\ldots ,o_{T}(m))\}\) as vectors representing the evolution of the physical leakage at \(T\) time points. Thus, the adversary first observes \(\mathcal {O}=\{o(m),m\in \mathcal {M}\}\).

2.2 Device Activity Modeling

Then the adversary measures a proxy for the electrical activity of the device. A target intermediate value of \(w\) bits manipulated by the device is chosen and its values are recorded for each possible combination of candidate sub-keys \(k\) and messages \(m\).

Then, for each candidate sub-key \(k\in \mathcal {K}\), the adversary splits the intermediate values into several clusters with similar electrical activity, using a selection function \(L(m,k)=v\in \mathcal {V}\) (typically the Hamming Weight (HW) or Hamming Distance (HD)). For each \(v\in \mathcal {V}\), the groups \(\mathcal {G}_{k}(v)=\{(m,o(m))\in \mathcal {M}\times \mathcal {O}\mid L(m,k)=v\}\) are formed and collected to give a partition \(\mathcal {P}(k)=\{\mathcal {G}_{k}(v),v\in \mathcal {V}\}\) of \(\mathcal {M}\times \mathcal {O}\).

Note that there are several ways to manipulate the intermediate values. For example, one could work at the word level or at the bit level. For details, see Appendix A.

2.3 Estimation of \(\kappa \)

The final step of a SCA consists in processing the \(\mathcal {P}(k)\) to get an estimate \(\hat{\kappa }\) of \(\kappa \). This is done through a distinguisher. In CPA, the distinguisher is Pearson’s correlation coefficient: at each time point \(t\in \{1,\ldots ,T\}\) and for each candidate sub-key \(k\in \mathcal {K}\), its value \(r_{k}(t)\) for the data in \(\{(L(m,k),o_{t}(m)),m\in \mathcal {M}\}\) is computed. Setting \(R_{k}=\max _{t\in \{1,\ldots ,T\}}r_{k}(t)\), \(\kappa \) is estimated by \(\hat{\kappa }=\arg \max _{k\in \mathcal {K}}R_{k}\). The rationale is that when \(k=\kappa \), the grouping of the traces induced by \(L(\cdot ,k)\) could show a strong enough linear association to allow distinguishing the correct sub-key from incorrect candidates. CPA is most fruitful when the data points \(\left\{ (L(m,\kappa ),o_{t}(m)),m\in \mathcal {M}\right\} \) exhibit a linear trend.

In MIA, the MI index is used. In the context considered here, where the random vector \((X,Y)\) is hybrid, that is \(X\) is discrete while \(Y\) is continuous with support \(S_{Y}\), the theoretical version of this index is defined as

$$\begin{aligned} MI=\sum _{x}l(x)\int _{S_{Y}}f(y|x)\log \frac{f(y|x)}{g(y)}\,dy, \end{aligned}$$
(1)

where \(f(y|x)\) is the conditional (on \(X\)) PDF of \(Y\) while \(g(y)\) (resp. \(l(x)\)) is the marginal PDF of \(Y\) (resp. \(X\))Footnote 1 and the symbol \(\sum _{x}\) refers to a sum taken over values \(x\) of \(X\) such that \(l(x)>0\). We have \(MI \ge 0\) and = 0 if and only if \(X\) and \(Y\) are statistically independent. There are other equivalent formulas defining the MI index, notably

$$\begin{aligned} MI&=H(Y)-\sum _{x}l(x)H(Y|x)\end{aligned}$$
(2)
$$\begin{aligned}&=H(Y)-H(Y|X), \end{aligned}$$
(3)

where \(H(Y)=-\int _{S_{Y}}g(y)\log g(y)dy\) is the (differential) entropy of random variable \(Y\) and similarly \(H(Y|x)=-\int _{S_{Y}}f(y|x)\log f(y|x)\,dy\).

Specializing formula (3), MIA can be expressed as computing at each time point \(t\in \{1,\ldots ,T\}\) and for each sub-key \(k\in \mathcal {K}\), the quantity

$$\begin{aligned} MI_{k}(t)=H(o_{t}(m))-H(o_{t}(m)|L(m,k)). \end{aligned}$$
(4)

The correct sub-key \(\kappa \) should satisfy

$$\begin{aligned} \kappa =\arg \max _{k\in \mathcal {K}}\left\{ \max _{t\in \{1,\ldots , T\}}MI_{k}(t)\right\} , \end{aligned}$$
(5)

and if \(\widehat{MI_{k}(t)}\) is an estimate of \(MI_{k}(t),\) an estimate \(\hat{\kappa }\) of \(\kappa \) is obtained as

$$\begin{aligned} \hat{\kappa }=\arg \max _{k\in \mathcal {K}}\left\{ \max _{t\in \{1,\ldots , T\}}\widehat{MI_{k}(t)}\right\} . \end{aligned}$$
(6)

The main difficulty in implementing a MIA is in estimating the values \(MI_{k}(t)\).

3 Estimating a PDF

Suppose a sample of independent copies \(\left\{ (X_{n},Y_{n}),n=1,...,N\right\} \) of \((X,Y)\) is at disposal. The problem of estimating the MI index (2) requires estimators of the entropies \(H(Y)\) and \(H(Y\mid x)\), which in turn requires estimators of the PDF \(g(y)\) and \(f(y|x)\). As stated earlier, estimation of these underlying PDF is a difficult statistical problem.

In general, a PDF estimator must offer a good trade-off between accuracy (bias) and variability (variance). In this section, we present the KDE. For the interested reader, details about other nonparametric methods (histogram or B-spline) can be found in [4, 14]. Note that, for simplicity, we restrict attention to the case of univariate PDF.

The kernel method uses a function \(K(\cdot )\), referred to as the kernel, in conjunction with a bandwidth \(h>0\). The KDE of \(g(y)\) is then

$$\begin{aligned} \hat{g}_{KDE}(y)=\frac{1}{N}{\displaystyle \sum _{n=1}^{N}K_{h}\left( y-Y_{n}\right) ,} \end{aligned}$$
(7)

where \(K_{h}(y)=h^{-1}K(y/h)\). Regarding the kernel, classical choices are the Gaussian function: \(K(y)=\frac{1}{\sqrt{2\pi }}e^{-y^{2}/2}\) or the Epanechnikov function: \(K(y)=\frac{3}{4}(1-y^{2})\) for \(|y|\le 1\), but in general, this choice has less impact on the estimator than the bandwidth, which is critical in controlling the trade-off between bias and variance. A huge literature, over-viewed in [17], has been devoted to choosing this tuning parameter, and the expression of an optimal (in an asymptotic and global mathematical sense) bandwidth has been obtained. A relatively good estimator of this optimal bandwidth is obtained by Silverman’s rule [18], which, for Epanechnikov’s kernel, is

$$\begin{aligned} h_{S}=2.34\ {\hat{\sigma }}N^{-1/5}, \end{aligned}$$
(8)

where \(\hat{\sigma }\) the sample standard deviation of the data \(\left\{ Y_{n},n=1,...,N\right\} \). From (2), \(H(Y)\) can be estimated by

$$\begin{aligned} H_{KDE}(Y)=-\int _{S_{Y}}{\hat{g}}_{KDE}\left( y\right) \log {\hat{g}}_{KDE}(y)\,dy, \end{aligned}$$
(9)

and similarly

$$\begin{aligned} H_{KDE}(Y|x)=-\int _{S_{Y}}{\hat{f}}_{KDE}(y|x)\log {\hat{f}}_{KDE}(y|x)\,dy, \end{aligned}$$
(10)

while \(l(x)\) can be estimated by \(N_{x}/N\) where \(N_{x}\) = \(\sum _{n=1}^{N}\mathbb {I}\{X_{n}=x\}\) where \(\mathbb {I}\{A\} = 1\) if event A is realized and 0 otherwise.

At this stage, another hurdle is encountered because the above computations require integration. To reduce the computational cost, one can choose points \(\mathcal {Q}=\{q_{0}<\ldots <q_{B}\}\) (referred to as query points) and estimate \(H(Y)\) by

$$\begin{aligned} H_{KDE}^{*}(Y)=-\sum _{b=1}^{B}{\hat{g}}_{KDE}(q_{b})\log {\hat{g}}_{KDE}(q_{b})(q_{b}-q_{b-1}), \end{aligned}$$
(11)

and similarly with \(H_{KDE}^{*}(Y|x)\) in place of (10). If there exists computational constraints, a small number of query points in \(\mathcal {Q}\) will be preferred, but then they be properly chosen to provide mathematical accuracy of the integral, a problem for which various solutions exist, for example via the rectangular method of (11) or more sophisticated quadrature formulas. Accuracy also depends on the number of these query points \(B\) and can be made arbitrarily good by increasing \(B\), at the expense of computational cost. We have taken the strategy of choosing these query points systematically, along a grid covering all the sample points, whose coarseness is chosen depending on the available computing power.

We stress that Silverman’s rule has been developed with the view of getting a globally good estimate of a PDF. There are no guarantees however that the formula yields a good bandwidth for estimation of complex functionals as the MI index, and this is a problem that requires further theoretical work in mathematical statistics. In the next section, we present our proposal to address this problem in the context of a SCA, where subject matter information allows another solution.

4 Setting the Tuning Parameters of KDE-MIA: Bandwidth and Query Points

To show the effect of various choices of bandwidth and query points on KDE-MIA, a small simulation study with synthetic data was conducted.

Ten thousand pairs \((HW,L)\) were drawn from the following non-linear leakage function with probability 0.5 either \(-0.9017+0.009\cdot HW-0.0022\cdot HW^{2}+\epsilon \) or \(-0.9017+\epsilon \), where \(\epsilon \sim N(0,0.005)\). The values of \(HW\) were independently computed from intermediate values of four independent binary (i.e. with range \(\{0,1\}\)) random variables. We used here synthetic data so that the exact value of the MI index (= 0.0312) could be computed. This leakage model is inspired from the actual EM data considered in Sect. 5.

Fig. 1.
figure 1

Behavior of the estimator of the MI index as a function of the number of query points and with bandwidth values \(h\). (\(h_{S}=0.003\))

Figure 1 shows the results of estimating the MI index as the bandwidth \(h\) and the number of equispaced query points are changed. As expected, Silverman’s rule yields a good estimate of the actual MI index when \(\mathcal {Q}\) contains a reasonable number of points (e.g. \(\ge \)16).

Note also that as the bandwidth is increased, the bias of the MI estimator increases (hence its variance decreases) as the estimator (i.e. MI) decays to zero. This is explained by the fact that, as \(h\) increases, all KDE get oversmoothed and converge to the same function that resemble the initial kernel spread over \(S_{Y}\), with the entropies converging to the same value and the MI index vanishing.

All this dovetails nicely with intuition and the admonishments in almost all publications on MIA that, in order to have a good estimator of the MI index, one should use adequate PDF estimators.

However, this does not guarantee maximal efficiency of the MIA. Based on real data, Fig. 2 shows surprisingly that increasing the bandwidth results in better attacks, in terms of partial Success Rate (pSR). This behavior was replicated with other data sets and suggests that good PDF estimation does not necessarily translate in efficiency of the attack, where larger bandwidths, and smoother PDF estimators, seem to yield better results.

Fig. 2.
figure 2

Partial Success Rate on 1\(^{st}\) Sbox at the last round of AES from the publicly available traces of DPAContestV2 [19] using the HD model at the word level.

It is this counterintuitive behavior that has led to the realization that the bandwidth could be seen, not as a nuisance parameter to be dealt with in a statistical estimation procedure, but more profitably as a lever that could be used to fine-tune a SCA. Note that such a lever does not exist in standard CPA and arises only with more complex distinguisher.

Our Adaptive Bandwidth Selection (ABS) procedure explicitly exploits the fact that there exists exactly one correct sub-key \(\kappa \). For all other \(k\in \mathcal {K},\) there should be statistical independence between the intermediate value and the leakage, so that \(MI_{k}=0\) when \(k \ne \kappa \) while \(MI_{\kappa }>0\) (for simplicity, we suppress the time point \(t\in \{1,\ldots , T\}\) from the notation because we consider only one point of leakage). We consider the average distance to the rivals instead of the second best rival to eliminate ghost peak effects. Thus, an alternate expression to (5) is

$$\begin{aligned} \kappa =\arg \max _{k\in \mathcal {K}}\left\{ MI_{k}-\overline{MI_{-k}}\right\} , \end{aligned}$$
(12)

where \(\overline{MI_{-k}}\) denotes the mean of all the \(MI\) values except \(MI_{k}\).

Now, using KDE, let \(\widehat{MI_{k}(h)}\) be an estimator of \(MI_{k}\) using the bandwidth \(h\) in all PDF involved in (2). The empirical version of (12) leads to the first estimator

$$\begin{aligned} \hat{\kappa }=\arg \max _{k\in \mathcal {K}}\left\{ \widehat{MI_{k}(h)}-\overline{\widehat{MI_{-k}(h)}}\right\} , \end{aligned}$$
(13)

where \(\overline{\widehat{MI_{-k}(h)}}\) stands for the mean of all estimators except \(\widehat{MI_{k}(h)}\). At this stage, the value \(h\) is still unused. The above suggests choosing this value to facilitate the identification of \(\kappa \). But, as noted earlier, when \(h\) increases all PDF in (2) are oversmoothed (so that all \(\widehat{MI_{k}(h)}\) decay to zero, albeit at a different rate for \(\widehat{MI_{\kappa }(h)}\). This suggests normalizing expression (13) and leads to the consideration of

$$\begin{aligned} \hat{\kappa }=\arg \max _{k\in \mathcal {K}}\left\{ \max _{h>0}\left[ \frac{\widehat{MI_{k}(h)}-\overline{\widehat{MI_{-k}(h)}}}{\overline{\widehat{MI_{-k}(h)}}}\right] \right\} \end{aligned}$$
(14)

as an estimator of \(\kappa \). The value of \(h\) where the inner max operator is attained will be noted \(h_{ABS}\).

Some computational and statistical comments are in order at this stage. First, even when \(MI_{k}=0\), \(\widehat{MI_{k}(h)}\ge 0\) (e.g. is upwardly biased) so that the denominator \(\overline{\widehat{MI_{-k}(h)}}\) is almost surely \({>}0\); this eliminates the risk of indeterminacy. Second, the estimator \(\widehat{MI_{\kappa }(h)}\) will tend to be greater than \(\widehat{MI_{k}(h)}\) when \(k \ne \kappa \), in the sense that \(Prob(\widehat{MI_{\kappa }(h)}>\widehat{MI_{k}(h)})\) will be high. Simple algebra shows that the term in bracket in (14) should then be in the interval \([-1,0]\) with high probability, whereas when \(k=\kappa \), this term should tend to be positive, thus allowing a good probability of discrimination for \(\kappa \). The following maximization on \(h\) aims at making this discrimination independent of the choice of \(h\) and is an automatic bandwidth selection procedure targeting the goal of getting a good estimate of \(\kappa \), in contrast to Silverman’s rule that aims at getting good estimates of the PDF involved in \(MI_{k}\). The maximization also has the side effect of smoothing the quirks that could occur in the individuals estimated PDF, and thus in the resulting \(MI_{k}\), with a single value of \(h\). Finally, the smoothness of \(\widehat{MI_{k}(h)}\) as a function of \(h\) allows to evaluate the max operator over a (finite to avoid trivial problems) grid of properly chosen \(h\) values ranging from some point in the neighborhood of the value \(h_{S}\) to some large multiple of this value, and this accelerates the computation of \(\hat{\kappa }\). In practice, (14) is implemented as

$$\begin{aligned} \hat{\kappa }=\arg \max _{k\in \mathcal {K}}\left\{ \max _{h\in \mathcal {I}}\left[ \frac{\widehat{MI_{k}(h)}-\overline{\widehat{MI_{-k}(h)}}}{\overline{\widehat{MI_{-k}(h)}}}\right] \right\} , \end{aligned}$$
(15)

where \(\mathcal {I} = \{h_i\}_{1\le i\le H}\) a set of \(H \ge 2\) bandwidths.

From an engineering point of view, we can see the action (via the value of \(h\)) of (14) as a focus adjustment to visualize a set of \(K\) pixels (i.e. \(K=256\) in the case of the MI index associated with each of the 256 key assumptions in the case of AES). The numerator allows to highlight a single pixel (a single key guess) while the denominator makes uniform the background of the picture, i.e. standardizing the estimated MI values associated with the remaining guesses.

To get some feeling about the behavior of our approach, we illustrate its action with real data. We consider the 1\(^{st}\) Sbox at the last round of AES from the publicly available traces of the DPAContestV2 [19] with a HD model at the word level. It turns out that \(h_{ABS}=1.8>h_{S} = 0.17\) (Volt), so that our PDF estimators are smoother.

Fig. 3.
figure 3

Values of the term in brackets in (14) for \(h_{S}\) (top panel) and \(h_{ABS}\) (bottom panel) for the 256 key guesses after processing all DPAContestV2 traces with HD model at word level.

Fig. 4.
figure 4

Evolution of the relative margins (%) for \(h_{S}\) and \(h_{ABS}\) with the number of processed DPAContestV2 traces with HD model at word level.

Figure 3 shows the action of our ABS criterion. The top panel gives the term in brackets of (14) for the 256 key guesses using \(h_{S}\). The bottom panel shows the same with \(h_{ABS}\). In both cases, the correct sub-key value (\(\kappa =83\)), is disclosed by MIA after the processing of all traces. However, for \(h_{S}\), the margin with the second sub-key guess is relatively small while being much larger using \(h_{ABS}\). Thus, the maximizing step over \(h\) reduces the impact of ghost peaks and allows a better discrimination of the correct sub-key.

Figure 4 presents another view sustaining this behavior. It reports the relative margin (%) of the best (correct) sub-key with respect to the second (wrong) best sub-key guess, i.e. the difference between the estimated MI for the correct sub-key and the highest MI value among wrong sub-keys, during each step of the attack for \(h_{ABS}\) and \(h_{S}\). Again, the approach based on \(h_{ABS}\) is more effective at reducing ghost peaks.

We close this section by noting that the principle embodied in (14) is consonant with the idea mentioned in [13] who suggest detecting outlier behavior of the correct key to perform successful recoveries. Also, when analysing a set of traces over many time points \(t\in \{1,\ldots , T\}\), in (15), the \(\max _{t\in \{1,\ldots , T\}}\) operation should be computed after \(\max _{h\in H}\) (to optimize the extraction of information at each leakage point), with the result being the operand of \(\arg \max _{k\in \mathcal {K}}\).

5 Experimental Results

In this section, we further compare the performance of our ABS-MIA, to MIA using \(h_{S}\), referred to as S-MIA. During this evaluation, CPA was also computed and used as a benchmark regarding three main criteria:

  1. 1.

    Efficiency, as measured by the number of traces to reach a target success rate [20].

  2. 2.

    Genericity, the ability of a SCA to be more or less successful under a unknown leakage model.

  3. 3.

    Computational burden.

Fig. 5.
figure 5

Partial Success Rates (pSR) evaluated on the 1\(^{st}\) Sbox at the last round of the AES with the DPAContestV2 traces over two scenarii: ‘mb’ (top) and ‘wd’ (bottom). The HD model was considered.

Comparisons were conducted according to two scenarii

  1. 1.

    Bit level (multi-bit).

  2. 2.

    Word level.

To distinguish between these scenarii, ‘mb’ and ‘wd’ suffixes are used in the remainder of the paper (see Appendix A for details).

Fig. 6.
figure 6

Partial Success Rates (pSR) evaluated on 4\(^{th}\) Sbox at the last round of AES with our EM traces over two scenarii: ‘mb’ (left top) and ‘wd’ (left bottom). The HD model was considered.

5.1 ABS-MIA Efficiency

The attacks were conducted with the traces of the DPAContestV2 [19] at the same fixed time point chosen in Sect. 4. Again, we focused on the 1\(^{st}\) Sbox at the last round of the AES. We used both the Gaussian and the Epanechnikov kernel but report only on the latter as both give very similar results. For the estimation of the MI index, a grid of 128 equidistant query points was taken to cover the peak-to-peak amplitude of traces (fixed by the choice of caliber and sensitivity during measurements) of the analog-to-digital converter of the oscilloscope (with a 8-bit resolution). Efficiency was measured by Success Rate (SR) following the framework in [20]. This metric has been sampled over 50 independent attacks to obtain an average partial Success Rate (pSR). The attacks were conducted with the HD model. Figure 5 illustrates the promising features of our approach. In all scenarii, ABS-MIA requires smaller number of measurements than S-MIA, demonstrating the improvement. More importantly, we observe that ABS-MIAmb compares favorably with the very efficient CPAwd. To sustain these results, we carried out CPA, ABS-MIA and S-MIA on a different data set of 10000 EM traces collected above a hardware AES block, mapped into an FPGA, operating at 50 MHz with a RF2 probe and 48 dB low noise amplifier. We concentrated on the 4\(^{th}\) Sbox at the last round. The HD linear model was once again considered. Figure 6 shows results similar to those obtained with the DPAContestV2 data above, with ABS-MIA showing again a large improvement over S-MIA while staying competitive with CPA.

5.2 ABS-MIA Genericity

To investigate genericity, the evaluations were performed using the second set of traces in the previous section under the unknown HW leakage model. As the pSR of the attacks using the ‘wd’ scenario never reached 10 % after processing the 10000 traces, we excluded it for further considerations. Interestingly, ABS-MIAmb is the only successful HW-based attack, with a pSR of 80 % after processing 7400 traces (see Fig. 7). Besides, all the variants of CPA (i.e. CPAmb and CPAwd) fail in this case.

Fig. 7.
figure 7

Partial Success Rates (pSR) evaluated on 4\(^{th}\) Sbox at the last round of AES for HW ‘mb’ model.

5.3 ABS-MIA Computational Burden

Regarding runtime, the computational cost of MIA is related to the number of entropies to be computed (17 for ‘mb’ and 10 for ‘wd’) and on the parameters used to compute each entropy (number of query points, choice of bandwidth). Recall that ABS-MIA is a two-stage procedure because, an additional preprocessing step to obtain \(h_{ABS}\) is required before launching the attack. To save time, we emphasize that this profiling step can be performed on a representative subset of the traces to compute an approximation of \(h_{ABS}\) because the terms in braces in (15) stabilizes quickly as the number of traces increases. Investigations were conducted with the ‘mb’ and ‘wd’ scenarii to perform ABS-MIA in Sect. 5.4. The time spent for this preprocessing for each Sbox is approximately one twentieth of the time required for S-MIA. However, this time is partly recovered by the reduction in the number of query points required for good behavior of ABS-MIA (16 compared to at least 96 for S-MIA; because the PDF are smoother in ABS-MIA, the integrals in (9), (10) are more easily approximated). This significantly reduces the number of computations involved in getting \(\widehat{MI_{k}(h)}\).

5.4 ABS-MIA: Global Success Rate for the DPAContestV2

Finally, we applied S-MIA and ABS-MIA to the DPAContestV2 traces and considered the global Success Rate (gSR) using the HD model. We also launched CPAwd and CPAmb as benchmarks. As in Sect. 5.1, 50 trace orders were considered. The evolutions of gSR are shown in Fig. 8. We observe that ABS-MIAmb dominates with, in particular, 15200 traces for the gSR to be stable above \(80\,\%\). On the other hand, S-MIA fails in recovering the key. Thirty minutes (resp. two hours) were necessary to complete both the preprocessing and the ABS-MIAwd (resp. ABS-MIAmb) on a personal computer.

6 Conclusions

MIA was motivated by its ability to capture all structures of dependencies between leakage and intermediate values. But the cost of this attractive feature is the difficulty in choosing adequately some tuning parameters. By focusing on the goal of optimizing the KDE-based MIA instead of the auxiliary task of estimating PDF, we have obtained an efficient bandwidth selection procedure. The resulting bandwidths are usually larger than the commonly used \(h_S\) (obtained by Silverman’s rule) and give better results in terms of attack efficiency across various experiments. We have shown that MIA driven by this method is comparable to the variant of CPA [2]. Additionally, we have reported that our MIA could succeed when CPA failed (see Sect. 5.2). Our approach could be applied to select the tuning parameters in other SCA involving nonparametric estimators, namely histograms and splines. We feel the present work shows there can be some benefits in adapting the principles of statistical methods to the task at end: SCA in the present case.

Fig. 8.
figure 8

Global Success Rates (gSR) evaluated at the last round of AES available traces of the DPAContestV2. The HD model was considered.