Journal of Hardware and Systems Security

, Volume 2, Issue 1, pp 69–82 | Cite as

Power Analysis Attack of an AES GPU Implementation

  • Chao Luo
  • Yunsi Fei
  • Liwei Zhang
  • A. Adam Ding
  • Pei Luo
  • Saoni Mukherjee
  • David Kaeli


In the past, Graphics Processing Unities (GPUs) were mainly used for graphics rendering. In the past 10 years, they have been redesigned and are used to accelerate a wide range of applications, including deep neural networks, image reconstruction and cryptographic algorithms. Despite being the accelerator of choice in a number of important application domains, today’s GPUs receive little attention on their security, especially their vulnerability to realistic and practical threats, such as side-channel attacks. In this work we present our study of side-channel vulnerability targeting a general purpose GPU. We propose and implement a side-channel power analysis methodology to extract all the last round key bytes of an AES (Advanced Encryption Standard) implementation on an NVIDIA TESLA GPU. We first analyze the challenges of capturing GPU power traces due to the degree of concurrency and underlying architectural features of a GPU, and propose techniques to overcome these challenges. We then construct an appropriate power model for the GPU. We describe effective methods to process the GPU power traces and launch a correlation power attack (CPA) on the processed data. We carefully consider the scalability of the attack with increasing degrees of parallelism, a key challenge on the GPU. Both our empirical and theoretical results show that parallel computing hardware systems such as a GPU are vulnerable to power analysis side-channel attacks, and need to be hardened against such threats.


Side-channel attack Correlation power analysis AES GPGPU 

1 Introduction

Graphics Processing Units (GPUs), originally designed for 3-D graphics rendering, have evolved into high performance general purpose processors. Today, a GPU can provide significant performance advantages over traditional multi-core CPUs by executing workloads in parallel on hundreds to thousands of cores. What has spurred on this development is the delivery of programmable shader cores, and high-level programming languages [7], including CUDA and OpenCL. Since then, GPUs have been used to accelerate a wide range of applications [11], including: signal processing, circuit simulation, molecular modeling and machine learning.

Motivated by the demands of efficient cryptographic computation over large amounts of data, GPUs are now being leveraged to accelerate a number of cryptographic algorithms. Before the introduction of CUDA and OpenCL, Cook et al. [3, 4] made their first efforts of mapping an AES cipher to a fixed graphics pipeline using OpenGL. By using CUDA, Manavski [21] implemented AES on an NVIDIA GPU G80, achieving a speedup as high as 5.9 times, as compared to the fastest CPU at the time. Iwai et al. achieved approximately a throughput of 35Gbps (Gigabits per second) on a NVIDIA Geforce GTX285 [12]. Li et al. [17] achieved the highest performance, around 60Gbps throughput on a NVIDIA Tesla C2050 GPU, which runs up to 50 times faster than an Intel Core i7-920. More recent work accelerated asymmetric ciphers by exploiting the power of GPUs [31]. Gilger et al. [10] implemented multiple block ciphers, both in CUDA and OpenCL. This provided an OpenSSL cryptographic engine solution that could easily accelerate common ciphers, and thus, reduces the development effort.

While the focus of prior work has been on accelerating crypographic implementations leveraging a GPU’s computation power, there is a little prior work that addresses the security of execution on a GPU. Di Pietro et al. [30] demonstrated that leakage of information can occur in a GPU’s shared memory, global memory and registers by using standard CUDA instructions. Maurice et al. [24] recovered data of a previously executed GPU application in a virtualized environment. Lombardi et al. [18] described how a GPU-as-a-Service in the Cloud can be misused and lead to denial-of-service attacks and information leakage. However, side-channel vulnerabilities of GPUs have received limited attention in the research community. Meanwhile, cryptographic systems based on other platforms, including microcontrollers [26], smart cards [25], application-specific integrated circuits (ASICs) [28] and FPGA platforms [20, 29], have all been shown to be highly vulnerable to side-channel attacks. We are the first to conduct research on side-channel analysis of GPUs. In our prior work [19], we presented the first power analysis of AES on a GPU, demonstrating the feasibility of an attack. Our group also launched the first timing attack of AES on a GPU [14].

Distinct from other computational platforms, the Single Instruction Multiple Thread (SIMT) model used on a GPU presents a range of challenges to side-channel analysis. During execution, each thread can be in a different phase of execution, generating some degree of randomness (i.e., timing uncertainties and misalignment of power traces). In addition, the complexity of the GPU hardware system makes it rather difficult to obtain clean and synchronized power traces. The power consumption model is very complicated. To address these challenges, we develop effective methods to obtain clean power traces, and build a suitable side-channel power leakage model to guide a successful power analysis attack. Our correlation power analysis (CPA) attack [19] demonstrates that AES-128 developed in CUDA on an NVIDIA C2070 GPU is susceptible to power analysis attacks.

Our prior successful power analysis attack of GPU [19] was implemented in a highly controlled environment. where many threads are employed for computation, but they repeatedly work on the same blocks of plaintext data. Our results [19] mark an important step forward, demonstrating the feasibility of key recovery on a GPU. However, we acknowledge that the controlled attack environment in our prior work helped to limit the random noise and amplify the side-channel power signal, and therefore increasing the signal-to-noise ratio (SNR) to make the power analysis attack effective.

In this work, we investigate the robustness of our side-channel power attack by utilizing different numbers of blocks of plaintext and increasing the degree of concurrency, further demonstrating the vulnerability of GPUs used for cryptographic computation in a more realistic setting. We extend our prior work significantly in both theory and experiments. We analyze multiple select functions, identify the best one for the attack, and provide quantitative analysis of why the chosen select function is successful. We analyze the scalability of the attack and explore how the size of the data, which translates to executing many parallel threads, impacts the attack success rate. We present evaluation results for attacks in much more realistic GPU execution scenarios, clearly demonstrating that the GPU incurs significant side-channel power leakage.

The novel contributions of this work include the following:
  • We present a detailed analysis of power leakage and construct our power model – the power leakage is decomposed into three parts, two of which are nearly linear with the key byte and the third is non-linear. By choosing different power models, the attack effectiveness (success rate) of CPA is analyzed.

  • We revisit the success rate model proposed by Fei [6] for CPA attacks on CPUs, and extend it to a parallel computing environment on a GPU, producing accurate predictions on the number of power traces needed to achieve a desired attack success rate.

  • We launch a large number of attacks while varying the degree of parallelism, examining the scalability of the attack. Both empirical and theoretical success rates are presented and analyzed. We examine how the degree of parallelism affects the effectiveness of our attack, and evaluate the success rate when the full capacity of the GPU is exploited in an AES implementation.

The rest of this paper is organized as follows. Section 2 provides background on CUDA and GPU architecture, AES ciphers, side-channel attacks, and introduces the attack model. In Section 3, we describe our experimental setup for acquiring power traces. We build the GPU’s power leakage model in Section 4. In Section 5, we present the attack results of extracting the last round key, present our extended success rate model and employ it to quantify the effectiveness of the attack. In Section 6, we suggest countermeasures to avoid such attack. Finally we conclude the paper in Section 7.

2 Preliminaries

In this section we begin by describing the GPU hardware architecture and associated software programming model. Then we review the specific AES cipher considered in this work and its implementation in CUDA on an NVIDIA GPU. We also review the basics of correlation power analysis attacks, followed by the attack model we use in this work.

2.1 GPU Basics

CUDA is a parallel computing platform developed by NVIDIA, and it is also an application programming interface for their GPUs [27]. CUDA source code is divided into two components, host code and device code. The host code is run on the CPU (typically C/C++ code), and the device code is executed on the GPU utilizing a number of parallel threads. A group of threads run the same kernel, but are processing different data. Threads are organized into blocks and blocks into a grid, as shown in Fig. 1. A thread is indexed both by its block id and thread id, which can be used to specify the data it works on. Thread scheduling is managed by the GPU according to the availability of hardware resources and the degree of parallelism inherent in the kernel.
Fig. 1

Typical CUDA threads and blocks present in a single grid [16]

In terms of the hardware structure, a GPU consists of several Streaming Multiprocessors (SM). Each SM can work as a complete independent processor, having its own register file, local cache, and control unit. In a SM, there are many Streaming Processors or CUDA cores, the main computation units on which threads run in parallel. The SM also includes additional hardware resources, including a warp scheduler and dispatch unit, which are needed to control the flow of instructions. The configuration of the SM and CUDA cores varies for different models of a GPU. While our work here specifically targets an NVIDIA TESLA C2070, which has 14 SMs, and 448 CUDA cores (32 for each SM), our attacks can be successful on a large number of different GPUs. The structure of one SM of a TESLA GPU is shown in Fig. 2.
Fig. 2

Block diagram of a TESLA C2070 streaming multiprocessor [16]

During execution, thread blocks will be dispatched to the SMs such that the threads in the same block can communicate through local shared memory. Within one thread block, 32 threads are grouped in a warp. A warp is the smallest schedulable program unit. Threads in one warp run in synchronization. Due to data dependencies and control hazards, if a warp in execution is stalled, the warp scheduler will dispatch another warp in order to make the best use of hardware resources.

In summary, the host code prepares the data and sets up the runtime environment for the kernel to run on the GPU. The programmer writes the device code which explicitly divides the job into blocks and threads. The GPU scheduler will decide when and where the blocks and threads will run in parallel, based on the available hardware resources, the data dependencies present in the code, as well as the presence of any memory conflicts. For a well-designed CUDA program, data dependencies and memory conflicts should be minimized, and the GPU should use all of its available resources and maximize parallel kernel execution.

2.2 AES and a CUDA Implementation of AES

AES [5] is a block cipher algorithm announced as the encryption standard by the National Institute of Standard and Technology in 2001. It is a symmetric-key cipher on a fixed-sized block of data. AES consists of a variable number of rounds, depending on the key length. For key sizes of 128, 192 and 256 bits, AES has 10, 12 and 14 rounds, respectively. For AES-128, one block of data is organized as a 4x4 array of bytes, termed the state. Each round is a sequence of four operations: SubByte, ShiftRow, MixColumn, and AddRoundKey, except for the initial and last rounds. The initial round has only an AddRoundKey, and the last round omits the Mixcolumn. All the round keys are derived from a single initial key by the key schedule.

In this paper, we implement an ECB (Electronic Code Book) mode AES-128 encryption using a CUDA-based kernel based on the reference implementation by Margara [23]. The T-table version of AES [5] is adopted, which is more efficient than the original byte-based SBox version. Our analysis is also applicable to other implementations with minor modifications. The three operations, SubByte, ShiftRow and Mixcolumn, are integrated into T-table lookups and XOR operations. Each thread is responsible for computing one column of a 16-byte AES state, so we need 4 threads to manage one whole block of data, as shown in Fig. 3. Note the aforementioned GPU thread block is different from the 16-byte AES data block, which is iteratively updated in each round, transforming the plaintext input into the ciphertext output.
Fig. 3

The round operation running as one thread

Figure 3 shows the round operations for one column running as a single thread. The initial round is simply an XOR of the plaintext and the first round key. There are nine middle rounds for the 128-bit AES encryption. Each thread takes one diagonal of the state as its round input, and maps each byte into a 4-byte word through a T-table lookup. These four 4-byte words are XORed together with the corresponding 4-byte round key bytes, and the result is stored in a column of the output state. The last round has no MixColumns operation, and so only one out of four bytes is kept after the T-table lookup, making it equivalent to a SBox lookup operation and ShiftRow. AddRoundKey is then performed on the four remaining bytes.

To begin an AES encryption, the plaintext is first copied into the GPU global memory. Each thread will load its own data into local memory, based on its block id and thread id. After encryption is complete, the ciphertext in local memory is copied back into global memory, and then copied into CPU memory. In ECB mode, the encryption of each block of data is independent, and thus can be parallelized as much as possible, depending only on the size of the data and the available GPU resources.

2.3 Side-channel Attack and Typical Correlation Power analysis

Side-channel attack is a type of attack based on information gained from the physical implementation of a cryptosystem. Side-channel information can include power consumption, electromagnetic emanation, timing information, and even sound [8]. Because the leaked information depends on the secret key, an attacker can utilize correlation to recover the key with a complexity less than brute force. The attack can be as simple as Simple Power Analysis (SPA) [15] using only a single power trace, reading the key bits directly by inspecting the temporal power variation. It can also be as complicated as Mutual Information Analysis [9], which is based on information theory.

In this paper, we use the Correlation Power Analysis (CPA) method to extract keys of AES-128. It is based on the correlation between the observed power information generated by the hardware and the power estimation calculated from a power model (which is a function of the key). To calculate the correlation, the attacker runs the cipher multiple times with different input plaintexts, and each run generates a power trace. For a block cipher, processing each byte is independent of others, and therefore the attack can be conducted in a divide-and-conquer manner, retrieving the subkey bytes one by one. The power model estimates the deterministic part of the power consumption, e.g., the Hamming distance model for CMOS technology [1], which computes the number of logic changes (i.e., 0-to-1 or 1-to-0) based on the known plaintext (or ciphertext) and a guessed subkey byte value. Equipped with a large enough number of power traces, we compute the Pearson correlation coefficient on the trace data and computed data for each guessed sub-key value. If the subkey guess is right, the calculated correlation tends to be higher than when the sub-key guesses are wrong. For a sub-key byte, one out of 256 possible values will be identified. For AES-128 with 16 bytes of key, the entire iteration would only be 2048 (= 28 × 16), much lower than the complexity of 2128 for a brute force attack.

2.4 Attack Model

For the attack model, we make the following assumptions:
  • We assume the attacker knows the encrypted ciphertext. For part of our analysis, we assume knowledge of the plaintext, where the size and value of the input message can be controlled.

  • The attacker can obtain the power consumption of the GPU for each encryption. Power traces can be obtained via measurement or on-chip power sensors, locally or remotely.

3 Experimental Setup and Power Trace Acquisition

In our experiments, we consider a client-server computing platform, where a GPU is used to accelerate AES encryption. The TESLA C2070 GPU is hosted on the PCIE interface of a workstation running Ubuntu. Figure 4 shows our system. In order to measure the power consumption of the GPU card, a 0.1Ω resistor is inserted in series with the ATX 12V GPU power supply. To minimize the invasion to the GPU card, other parts of the board are kept untouched. Since the output of the ATX power is almost constant at 12V when connected to one end of the resistor, we only need to measure the voltage at the other end (using an oscilloscope) to get the voltage drop across the resistor. The attacker sends plaintext to the server. Upon receiving the data file, the server copies it to the GPU memory for encryption. The ciphertext is generated on the GPU, and then returned to the attacker.
Fig. 4

The power measurement setup used in this work

During encryption, the oscilloscope records the power consumption for the attacker with the sampling frequency of 5GHz while the GPU’s processor clock frequency is 1.15GHz. When the GPU is idle, it consumes little power, and the voltage the oscilloscope measures is close to the supply voltage, 12V. As the AES encryption starts, more power is drawn by the GPU, so the voltage drops. After it is done, the voltage returns to its original level. Figure 5 shows a sample power trace for our GPU running AES, with the 12V DC signal subtracted. We found the speed of the voltage drop is much slower as compared to the speed of the voltage rise. This may be because when the GPU starts the encryption, it gradually loads the data into memory, but ends by finishing all work in parallel. The power trace is also very noisy, and there seems to be no regular pattern corresponding to the AES round iterations.
Fig. 5

A sample power trace of our GPU running AES, with the DC signal subtracted

Power trace acquisition on a GPU is performed very differently than the approaches used on MCUs, FPGAs and ASICs [20, 26, 28] for a number of reasons. First, the power trace of a GPU contains much more noise for several reasons. Our measurement point on the ATX power supply is far away from the GPU silicon die power supply. On the GPU card, there are many DC-DC units converting the 12V voltage into various other voltage values needed by the GPU, which introduces switching noise, so we need to filter out the desired power information by using large capacitors. The measured total power consumption of the GPU card also contains power consumption of the cooling fan, off-chip memory, PCIE interface and many other auxiliary circuits. These unrelated sources of power consumption further contribute to the noise.

Second, since there is no GPIO (General-Purpose Input/Output), or a dedicated pin, on the GPU to provide a precise trigger signal to indicate the start or end of the encryption, the oscilloscope takes the rising edge of the power trace as the trigger signal. Because a power trace can be very noisy and its rising and falling edges are not as clean and uniform as we desire, it is challenging to consistently identify the beginning of an encryption in a trace. Therefore, the traces for different encryptions are not synchronized.

The last and most important issue is that the parallel computing behavior of the GPU may cause timing uncertainty in the power traces. The GPU scheduler may switch one warp out and bring in another at any time, and this behavior is not under the programmer’s control. Moreover, there are multiple streaming multiprocessors, each performing encryption concurrently and independently. These facts all pose significant challenges for GPU side-channel power analysis.

4 Power Model Building

Next, we build the power leakage model of the GPU for CPA, which will provide us with the power estimation formula PE(k), where k is the key candidate. The correlation between PE(k) and the actual power traces is then used to find the secret key.

4.1 Hamming Distance Based Power Leakage Extraction

The principle of a side-channel power analysis attack is that the power consumption of a cryptosystem is determined by key-dependent internal state switchings. The power consumption of a CMOS circuit consists of static and dynamic power [13]. The static power persists as long as the circuit is powered on, due to the leakage of reversed pn junctions. The static power dissipation depends mainly on the temperature and working voltage, and less on the internal data. Hence the static power does not vary much and is treated as noise in the power model. The dynamic power is due to switching of voltages in circuit gate outputs (intermediate states). One part of this power is for charging and discharging the parasitic capacitance. Another part of the dynamic power is consumed by the short circuit formed by the PMOS and NMOS transistors to change the output voltage. In a simplified model, the magnitude of the dynamic power consumption is linear with the number of changing bits (i.e., the Hamming distance) of the intermediate state. Next we find the intermediate states of our AES GPU implementation that depend on the secret key and derive their Hamming distances.

If any round key is retrieved, we can deduce the secret 128-bit AES key by reversing the key scheduling [5]. Hence we focus on finding leakage for each subkey byte in the last round. By disassembling the CUDA code, we find the related instructions are:
$$\begin{array}{lllll} &\text{LOAD} \qquad R_{n} \qquad [R_{n}] \\ &\text{AND} \qquad \ \, R_{n} \qquad R_{n} \qquad 0X000000FF\\ &\text{XOR} \qquad \ \ R_{m} \qquad R_{m} \qquad R_{n} \end{array}$$
Figure 6 shows the corresponding operations on resources (registers). The GPU uses a 4-byte register R n to hold one byte of the last round input state s i n (i.e., the three most significant bytes of R n are zero). Then the GPU loads the 4-byte T-table contents T a (s i n ) into the same register. Because there is no MixColumn operation in the last round, only one byte in the register is needed. Hence the other three bytes are ANDed with zeros, and the result s o u t remains in the register. The s o u t is in fact the SubByte and ShiftRow output corresponding to the input s i n . Then s o u t in register R n is XORed with its corresponding last round subkey byte k1 in register R m to get the final cipher byte c1 (also in R m ).
Fig. 6

Last round operation on registers for one state byte

These three instructions involve two registers, and result in three transitions at three clock edges in the registers. Hence the three Hamming distances (two on register R n and one on register R m ) can be determined as follows:
$$\begin{array}{@{}rcl@{}} h_{1}&=&\text{HW}((0,0,0,s_{in}) \oplus T_{a}) = \text{HW}(2\otimes s_{out})+\\ &&\text{HW}(3\otimes s_{out})+ \text{HW}(s_{out}) + \text{HW}(s_{in} \oplus s_{out}), \end{array} $$
$$\begin{array}{@{}rcl@{}} h_{2}&=&\text{HW}(T_{a} \oplus (0,0,0,s_{out})) = \\ &&\text{HW}(2\otimes s_{out})+\text{HW}(3\otimes s_{out})+ \text{HW}(s_{out}), \end{array} $$
$$\begin{array}{@{}rcl@{}} h_{3}&=&\text{HW}(k_{1} \oplus c_{1})=\text{HW}(s_{out}). \end{array} $$
where ⊕ is XOR, and ⊗ denotes the multiplication in field GF(28).

Since the attacker knows the cipher byte c1, he/she can calculate these Hamming distances from the cipher byte c1 and a guessed subkey byte k1 value, in reverse order. First s o u t = k1c1, linearly depends on the subkey byte value. Then s i n can be recovered through looking up the inverse of the SBox table, which is non-linear of the key. T a is the 4-byte T-table value looked up by s i n , which consists of four components, 2 ⊗ s o u t , 3 ⊗ s o u t , s o u t , s o u t , in different order according to the s i n byte position. All the three Hamming distances depend on the guessed key value.

4.2 GPU’s Power Leakage Model

In previous work targeting CPA on FPGAs and CPUs, usually the attacker uses a power model based on nonlinear Hamming distances, such as h s = HW(s i n s o u t ). Adopting such a power model is very effective for attack when the power traces can be aligned and the h s -specific operation occurs at a fixed time t in all the power traces. The attacker just needs to analyze the power values at one time point. However, for our parallel computing GPU platform, there are many concurrent threads running. For the three Hamming distances in Eqs. 12 and 3 for each thread, the first is nonlinear, while the other two are linear in terms of key dependencies. Given the non-determinism of the hardware thread scheduler, the operations corresponding to different leakage Hamming distances may be executed at different times by each thread. The resulting power trace for the GPU contains multiple leakages at random time locations.

Assuming there are M threads running and the power consumption is measured at N discrete times t1, ..., t N , for each thread i, there are a number of H Hamming distance leakages as follows: hi, 1, ..., hi, H. We denote the time of the hi,j-corresponding operation as ti,j. Then the power consumption at time t in one power trace is:
$$ P(t)= a \sum\limits_{i = 1}^{M} \sum\limits_{j = 1}^{H} \mathbb{I}\{t=t_{i,j}\}h_{i,j}+R(t), $$
where \(\mathbb {I}\{t=t_{i,j}\}\) is the identity function (i.e., when t = ti,j, \(\mathbb {I}(t)\) is 1, otherwise it is 0). a is the unit power consumption for a single bit switching, and R(t) is the noise at time t. The noise R(t) includes all other unrelated power consumption (e.g., operations by other threads executing at the same time period as the measurement, and other unrelated concurrent operations in the same thread).

Since each thread’s power trace is misaligned with other threads’ traces, ti,j can be shifted randomly for different threads. Given the parallel computing behavior of the GPU, it becomes very difficult to identify the exact value of ti,j, since threads can be executing different instructions at any time.

To retain the information of hi,j without the knowledge of ti,j, we propose to sum up P(t) over time t, similar to the sliding window DPA in [2]. Corresponding to the summation over the N discrete times t1, ..., t N of a power trace, the power model becomes the average power consumption of each power trace:
$$\begin{array}{@{}rcl@{}} P&=&a \sum\limits_{i = 1}^{M} \sum\limits_{j = 1}^{H} h_{i,j}+R \end{array} $$
$$\begin{array}{@{}rcl@{}} &=&aPE+R \end{array} $$
where R is the summation of noise over time t, and the power estimation is
$$ PE= \sum\limits_{i = 1}^{M} \sum\limits_{j = 1}^{H} h_{i,j}. $$

To launch a CPA attack on a GPU using the average (total) power consumption model described above, the power traces will be processed accordingly, and the Pearson correlation coefficient is derived between the predicted and measured values. The effectiveness of a CPA attack on a highly parallel computing platform has to be evaluated first.

Our previous modeling work [6] showed that the success rate of CPA can be predicted by two factors: i) the physical side-channel signal-to-noise ratio and ii) algorithmic confusion coefficients (a metric defined to capture the key distinguishability due to the algorithm and the select function). The noise level tends to be higher in the CPA attack on a GPU since the summation over time includes much more noise due to irrelevant operations in multiple threads. The confusion coefficients also differ in the GPU, as we are targeting the sum of three Hamming distances. We next derive the confusion coefficients of AES on a GPU.

From Fig. 6 and Eqs. 12 and 3, the select function relating to one byte of ciphertext c1 in one thread is:
$$ \begin{array}{llll} h_{s}&=h_{1}+h_{2}+h_{3} \\ &= 3 \text{HW}(s_{out})+ 2\text{HW}(2\otimes s_{out})+ 2\text{HW}(3\otimes s_{out})\\ &\quad+\text{HW}(s_{in} \oplus s_{out}) \end{array} $$
where s o u t = c1k1. Compared to CPA on other non-parallel computing platforms, where the attack assumes HW(s i n s o u t ), the h s for a GPU contains extra terms 3HW(s o u t ) + 2HW(2 ⊗ s o u t ) + 2HW(3 ⊗ s o u t ). Note that for a single bit change in k1, HW(s o u t ) only changes by one, and so does the HW(2 ⊗ s o u t ) and the HW(3 ⊗ s o u t ) in most time periods (when the multiplication result does not overflow). These extra terms have a nearly linear relationship to the hamming weight of the key. Hence, the distribution of confusion coefficients is more spread out on a GPU than in other computing platforms, leading to a less powerful CPA attack. The confusion coefficient [6] is defined as the variance of the difference between the power estimations calculated from the true key and one false key (i.e., a higher confusion coefficient, larger distance between the true key and false key, and easier to distinguish).
$$ \kappa_{i} = E[(PE_{i} - PE)^{2}] $$
where PE i is the power estimation of a false key k i , PE is the power estimation of the correct key, and κ i is its confusion coefficient.
Figure 7 shows the distribution of the normalized confusion coefficients calculated for one subkey byte of the last round key for the GPU. Because of the linearity, the distribution is widely spread. The false keys possessing near-zero confusion coefficients could become ghost keys, which are hard to distinguish from the true key when working with high noise levels. Figure 8 shows the distribution of the normalized confusion coefficients for a select function, excluding the linear terms and keeping only the non-linear term of HW(s i n s o u t ), which is the normal situation for a FPGA and a CPU . The confusion coefficients are more concentrated in larger values, and therefore the true key is easier to identify.
Fig. 7

Distribution of confusion coefficient for one byte of the key for the GPU

Fig. 8

Distribution of the confusion coefficient without linearity

5 Key Discovery of GPU by Power Analysis Attacks

We next discuss our power analysis of the targeted GPU-based AES implementation. We first extract the full AES key under the chosen-plaintext attack model, where the adversary can choose the size and content of the plaintext message. In this attack, the signal strength is boosted by redundant AES encryption instances of the same plaintext message, making full use of the GPU’s hardware resources to overcome the problem of high noise and the linearity in confusion coefficients. Then, we extend our analysis to the known-cipher attack model, where the attacker has no control of the input of plaintext, and predict the number of traces needed for specific attack success rates.

5.1 Full Key Extraction

To take advantage of the parallel computing structure of the GPU, we let each AES block encryption use L = 4 threads. We first set the message size to 8 blocks, requiring 32 concurrent threads, which we call an AES encryption instance. We consider the attack in a multi-threaded environment by executing multiple concurrent AES instances. Since the power consumption of one AES encryption is very small, we first use 768 instances of AES processing the same message to increase the power consumption for the measurement. That is, we use 768 × 4 × 8 = 24,576 threads for each power trace. Across different power traces, the plaintext messages vary and are generated independently of each other.

We first test the viability and correctness of our power model. We increase the number of power traces from 1000 to 100,000, in steps of a 100 traces. For each selected number of traces, the power estimates for one selected subkey byte (the first byte of last round key) are computed for all 256 possible values. The Pearson correlation coefficient for each subkey byte guess is computed using the power estimate according to Eq. 7 and the average (sum) of the power points in each power trace, which is plotted in Fig. 9.
Fig. 9

Correlation between the power traces and the Hamming distances for all possible subkey byte values

As shown in Fig. 9, after 40,000 traces, the correct subkey clearly stands out, producing the largest (negative) correlation coefficient. The negative correlation coefficient is due to the usage of the voltage to represent the power consumption here. Lower voltage in fact means higher power consumption. In Section 5.2, we build a statistical model to estimate the success rate of getting the correct subkey for a specific number of traces.

We also observe that, although the correlation coefficient of the correct subkey stands out, some false subkey values result in very close correlation coefficients, no matter how many traces we collect and analyze. This is due to the fact that the power model (7) depends heavily (and linearly) on the key value, as discussed in the end of Section 4.2. Therefore, the false values for subkeys possessing small Hamming distances from the true value are not as easily distinguishable in a GPU setting, especially when compared to the same values obtained on a CPU or FPGA computing platform.

Next, we run CPA on the power traces, extracting the last round key, byte by byte. Figure 10 shows the attack results for 100,000 traces. We label the true subkey bytes with ‘*’ and the candidates with lowest correlation with a ‘∘’. In the figure, all the correct subkey byte values have the lowest correlation coefficients (i.e., the attacker can recover the exact last round key with control of the plaintext).
Fig. 10

Our CPA attack results

The select function for our attack (shown in Eq. 8) includes all three Hamming distances in the last round which are key-dependent. In general, a non-linear select function would result in higher and more concentrated confusion coefficients, which would lead to a higher success rate under the same SNR (according to the success rate prediction model given in [6]). With three Hamming distances, adding them up would reduce the noise and increase the SNR, and would also result in different confusion coefficients as compared to using only one Hamming distance. For the three Hamming distances h1, h2 and h3, h1 is nonlinear, and h2 and h3 are mostly linear with the subkey byte’s value. We generate seven different select functions based on different combinations of the three Hamming distances, calculate their corresponding confusion coefficients, derive SNRs from the measured traces, and plug them into the success rate prediction formula (detailed in next section).

Figure 11a shows that among the three Hamming distances, h1 is the best select function due to its non-linear nature and higher signal level (which is a little bit higher than h2). h3 is the worst select function, since it only provides a linear component. In addition, in Fig. 11b, comparing the three groups of curves (all three Hamming distances included, two included, and only one, the highest h1, included), we see that including more leakage Hamming distances always produces better results. For this reason, in the subsequent analysis, we always use all the Hamming distances (h1, h2 and h3) for effective attacks.
Fig. 11

Success rate with different combinations of linear and nonlinear Hamming distances

5.2 A More Realistic Execution Environment

GPUs are powerful platforms, able to run thousands of concurrent threads. GPUs can be used effectively to accelerate AES encryption/decryption. In an actual AES implementation on a GPU, a large number of threads would be encrypting/decrypting different plaintext/ciphertext values concurrently. A highly-tuned implementation of AES would try to utilize the full capacity of the GPU.

Given this more realistic scenario, the attacker does not have control over the plaintext. We would like to understand how successful the attack will be, as a function of the number of power traces collected. To begin extending our attack model to a more realistic execution environment, we build off our previous work presented above [19], and leverage the success rate model for CPA proposed by Fei [6], to predict the number of traces needed to launch a successful attack under this situation. Mangard in [22] also proposed a model to estimate the trace number, but it failed to consider the effects of false key candidates (confusion coefficients), which has a major impact in our attack.

Assume we have a B block plaintext message, where each block requires L (L= 4 in our implementation) threads to encrypt. If BL is smaller than the maximum capacity of the GPU M, for simplicity, we assume the other MBL threads are idle and do not contribute noise in our power measurement. Later, we will add the effect of idle threads into our analysis. We also assume that the noise generated by the BL threads is i.i.d., Gaussian distributed, with a zero mean. Then the standard deviation of the noise R in Eq. 6 can be expressed as
$$ {\sigma_{N}^{B}} = \sqrt{B}{\sigma_{N}^{1}} $$
where \( {\sigma_{N}^{B}} \) is the noise standard deviation of the power trace from enrypting a B block plaintext message, and \( {\sigma_{N}^{1}} \) is for B = 1.
For the true key value, the power estimation becomes:
$$ PE^{B}= \sum\limits_{i = 1}^{B} \sum\limits_{j = 1}^{H} h_{i,j} $$
where hi,j is the j t h Hamming distance of one AES instance in the i t h message block (with the correct key byte), and there are H Hamming distances for each data block.
For the l t h false key, its power estimation can be expressed as:
$$ P{E_{l}^{B}}= \sum\limits_{i = 1}^{B} \sum\limits_{j = 1}^{H} h_{i,j,l} $$
where hi,j,l is the j t h Hamming distance of one AES instance in the i t h block, for the l t h false key.
In previous work [6], Fei et al. described a model to calculate the theoretic success rate quantitatively for one key byte as
$$ SR = {\Phi}_{N_{k}-1}\{\sqrt{n}\frac{|a|}{2\sigma_{N}}\mathbf{K}^{-1/2}\boldsymbol{\kappa}\} $$
where \( {\Phi}_{N_{k}-1} \) is the cumulative distribution function of a N k − 1 dimensional standard normal distribution, N k is the number of key candidates, n is the number of traces, a is the coefficient in Eq. 6, and σ N is the standard deviation of noise. κ is the vector holding the confusion coefficients.
For our highly parallel GPU environment, the confusion coefficient for the l t h wrong key is:
$$ {\kappa_{l}^{B}} = E[|PE^{B}-P{E^{B}_{l}}|^{2}] $$
The (l, m) t h element in the three-way confusion coefficient matrix K is
$$ K^{B}_{l,m}=E[(PE^{B}-P{E^{B}_{l}})(PE^{B}-P{E^{B}_{m}})] $$
We add the superscript B to emphasize that we are dealing with B blocks of plaintext encryption on a GPU.
We denote the difference between power estimations of the true key and a false key as
$$ {Q^{B}_{l}} = PE^{B} - P{E^{B}_{l}} = \sum\limits_{i = 1}^{B} \sum\limits_{j = 1}^{H}(h_{i,j} - h_{i,j,k})= \sum\limits_{i = 1}^{B} q_{i,l} $$
where \(q_{i,l}={\sum }_{j = 1}^{H}(h_{i,j} - h_{i,j,l})\), the difference of the Hamming distance of the i t h block between the correct key value and the l t h false key. Due to the diffusion property of AES, the mean of qi,l is 0, and qi,l and qj,l are independent if ij, so we have
$$ {\kappa^{B}_{l}}=E[|PE^{B}-P{E^{B}_{l}}|^{2}]=E[|{Q^{B}_{l}}|^{2}] = B \sigma_{q,l}^{2}, $$
with σq, l2 denoting the variance of qi,l.Subscript i is dropped because the variance is independent of i.
For a three-way confusion coefficient matrix K B , we denote
$$ R_{l,m} = E[q_{i,l}q_{i,m}] $$
Then we have
$$ \begin{array}{lllllll} K^{B}_{l,m}&=E[(PE^{B}-P{E^{B}_{l}})(PE^{B}-P{E^{B}_{m}})]\\ &=E[{Q^{B}_{l}}{Q^{B}_{m}}]=E[{\sum}_{i = 1}^{B} q_{i,l} {\sum}_{j = 1}^{B} q_{j,m}]=B R_{l,m} \end{array} $$
After plugging Eqs. 1019 and 17 into Eq. 13, we get
$$ \begin{array}{lllllllllll} SR^{B} &= {\Phi}_{N_{k}-1}\{\sqrt{n}\frac{|a|}{2{\sigma^{B}_{N}}}(\mathbf{K}^{B})^{-1/2}\boldsymbol{\kappa}^{B}\} \\ &= {\Phi}_{N_{k}-1}\{\sqrt{n}\frac{|a|}{2{\sigma^{1}_{N}}}(\mathbf{K}^{1})^{-1/2}\boldsymbol{\kappa}^{1}\} \end{array} $$
This shows that with the same number of traces, the success rate is independent of the block size B.
To produce the values of a and \( {\sigma_{N}^{1}} \), and verify the correctness of Eqs. 13 and 20, we performed attacks with three different plaintext message sizes, with B equal to 8, 16 and 32. Just as in the experiments in Section 5.1, we utilize M = 24,576 threads to increase the power consumption for better measurement resolution. However, by doing this, each AES instance will be replicated M/BL times. As a result, we have:
$$\begin{array}{@{}rcl@{}} & &\sigma_{N}^{'B} = \sqrt{\frac{M}{L}}{\sigma_{N}^{1}} \end{array} $$
$$\begin{array}{@{}rcl@{}} &&\kappa^{\prime B}_{l}=(\frac{M}{BL})^{2} B \sigma_{q,l}^{2} = (\frac{M^{2}}{BL^{2}}) \sigma_{q,l}^{2} \end{array} $$
$$\begin{array}{@{}rcl@{}} &&K^{\prime B}_{l,m} =(\frac{M}{BL})^{2} B R_{l,m} = (\frac{M^{2}}{BL^{2}}) R_{l,m} \end{array} $$
$$ SR^{\prime B} = {\Phi}_{N_{k}-1}\{\sqrt{n}\frac{|a|}{2{\sigma^{1}_{N}}} \sqrt{\frac{M}{BL}} (\mathbf{K}^{1})^{-1/2}\boldsymbol{\kappa}^{1}\} $$
This suggests that when M = BL, there would be no repeated AES encryption instances, SR B = SRB, as expected. When M > BL, we have a more typical scenario of how a GPU would leverage many threads for acceleration (while many threads are used, the GPU is not run at full capacity), so only BL threads are running for computation and the rest MBL threads are idle. Formula (20) provides the success rate for this attack. Note that formula (24) is for a reference attack, where there are \(\frac {M}{BL}\) instances of the same computation, i.e., with much higher side-channel SNR. The reason for us to examine the reference attack, and run experiments with repeated computations, is to generate power measurements with sufficient resolution. We can then extrapolate the results for the reference attack and predict the number of traces needed for the real case attack, requiring \(\frac {M}{BL}\) times more power traces.
Figure 12 shows both the empirical success rates for the reference attacks and theoretical calculations generated by Eq. 24, for M = 24, 576, B is 8, 16, or 32. For each B number, the two curves track each other very well.
Fig. 12

Empirical and theoretical success rates for 8, 16 and 32 blocks of plaintext

Taking B = 8 in Fig. 12 as an example, for a realistic attack with no repeated AES encryption instances, to obtain the same success rate, the number of traces would need to be 756 (M/BL = 24, 576/(4 × 8)) times that of the reference attack. For example, to achieve a 70% success rate, the number of traces needed for the reference attack is approximately 33,000, while the number needed for a realistic attack would be 25.3 million (756 × 33, 000). Furthermore, when M > BL, the unutilized threads may be used to do other computation, which will increase the noise level without contributing any side-channel signal, and thus we may need even more traces to obtain the same success rate.

6 Countermeasures

To defend AES on GPU from side-channel power analysis, countermeasures should be applied. The common countermeasure of masking would also work on GPU. Since the intermediate values will be randomized by the mask, the attacker can not correlate the power with the model any more. However, another typical countermeasure, random delay of instructions, is not effective on GPU. One reason is that the instructions issued by different warps on GPU are already randomized in some way by the scheduler. Another reason is that the attacker can always average the entire trace to capture the leakage as it is done in this paper.

With sufficient key storage and adequate key management, users should avoid using the same secret key for all the encryption on GPU. Not necessarily different key for different data block, a couple of keys for the GPU would significantly increase the attack complexity and render the attack infeasible. Another effective countermeasure is to initialize the registers with random values before writing to them. It introduces some degradation of the performance, but we can limit the random initialization only to the sensitive registers, e.g., the last round registers, to minimize the effect. Also thanks to the high performance of the GPU, such a small overhead would be negligible.

7 Conclusion

In this paper, we present side-channel power analysis on a GPU AES implementation. We describe a process to obtain power consumption measurements on an NVIDIA GPU. The various challenges of power analysis on a GPU are highlighted. To overcome these difficulties, we have proposed effective strategies to process the power traces for a successful correlation power analysis. The corresponding power model is built based on the CUDA PTX assembly code. We begin our analysis of the attack assuming control over the plaintext, and analyze its scalability as we increase the size of plaintext. We find a linear relationship between the amount of plaintext and the the number of traces needed, though the computation complexity grows exponentially. The attack results show that a GPU, a highly-popular but very complex, and parallel computing device, is vulnerable to side-channel power analysis attacks.


  1. 1.
    Brier E, Clavier C, Olivier F (2004) Correlation power analysis with a leakage model. InL: Cryptographic hardware & embedded systems, vol 3156, pp 16–29Google Scholar
  2. 2.
    Clavier C, Coron JS, Dabbous N (2000) Differential power analysis in the presence of hardware countermeasures. Springer, Berlin, pp 252–263MATHGoogle Scholar
  3. 3.
    Cook D, Keromytis AD (2006) Cryptographics: exploiting graphics cards for security, vol 20. Springer Science & Business MediaGoogle Scholar
  4. 4.
    Cook DL, Ioannidis J, Keromytis AD, Luck J (2005) Cryptographics: secret key cryptography using graphics cards. In: Topics in cryptology–CT-RSA 2005. Springer, pp 334–350Google Scholar
  5. 5.
    Daemen J, Rijmen V (1998) AES proposal: RijndaelGoogle Scholar
  6. 6.
    Fei Y, Ding AA, Lao J, Zhang L (2015) A statistics-based success rate model for DPA and CPA. J Cryptogr Eng 5(4):227–243CrossRefGoogle Scholar
  7. 7.
    Gaster B, Howes L, Kaeli DR, Mistry P, Schaa D (2013) Heterogeneous computing with openCL: revised openCL 1.2 edition, 2nd edn. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
  8. 8.
    Genkin D, Shamir A, Tromer E (2014) RSA key extraction via low-bandwidth acoustic cryptanalysis. In: Advances in cryptology–CRYPTO 2014. Springer, pp 444–461Google Scholar
  9. 9.
    Gierlichs B, Batina L, Tuyls P, Preneel B (2008) Mutual information analysis. In: Cryptographic hardware & embedded systems, pp 426–442Google Scholar
  10. 10.
    Gilger J, Barnickel J, Meyer U (2012) GPU-acceleration of block ciphers in the OpenSSL cryptographic library. In: Information security. Springer, pp 338–353Google Scholar
  11. 11.
    Hwu WM (2011) GPU computing gems emerald edition, 1st edn. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
  12. 12.
    Iwai K, Kurokawa T, Nisikawa N (2010) Aes encryption implementation on cuda gpu and its analysis. In: 2010 First international conference on networking and computing, pp 209–214.
  13. 13.
    Jan MR, Anantha C, Borivoje N (2003) Digital integrated circuits: a design perspectiveGoogle Scholar
  14. 14.
    Jiang ZH, Fei Y, Kaeli D (2016) A complete key recovery timing attack on a gpu. In: 2016 IEEE International symposium on high performance computer architecture (HPCA), pp 394–405.
  15. 15.
    Kocher P, Jaffe J, Jun B, Rohatgi P (2011) Introduction to differential power analysis. J Cryptogr Eng 1(1):5–27CrossRefGoogle Scholar
  16. 16.
    Leischner N, Osipov V, Sanders P (2009) Nvidia fermi architecture white paper.
  17. 17.
    Li Q, Zhong C, Zhao K, Mei X, Chu X (2012) Implementation and analysis of aes encryption on gpu. In: 2012 IEEE 14th International conference on high performance computing and communication, 2012 IEEE 9th international conference on embedded software and systems, pp 843–848.
  18. 18.
    Lombardi F, Di Pietro R (2014) Towards a GPU cloud: benefits and security issues. In: Continued rise of the cloud. Springer, pp 3–22Google Scholar
  19. 19.
    Luo C, Fei Y, Luo P, Mukherjee S, Kaeli D (2015) Side-channel power analysis of a GPU AES implementation. In: IEEE Int. Con. on computer design (ICCD). IEEE, pp 281–288Google Scholar
  20. 20.
    Luo P, Fei Y, Fang X, Ding AA, Leeser M, Kaeli DR (2014) Power analysis attack on hardware implementation of MAC-Keccak on FPGAs. In: Int. Conf. on ReConFigurable computing and FPGAs (ReConFig), pp 1–7Google Scholar
  21. 21.
    Manavski S (2007) CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In: IEEE Int. Conf. on signal processing & communications, pp 65–68Google Scholar
  22. 22.
    Mangard S (2004) Hardware countermeasures against DPA – a statistical analysis of their effectiveness. Springer, Berlin, pp 222–235MATHGoogle Scholar
  23. 23.
    Margara P (2015) Engine-CUDA, a cryptographic engine for CUDA supported devices.
  24. 24.
    Maurice C, Neumann C, Heen O, Francillon A (2014) Confidentiality issues on a GPU in a virtualized environment. In: Financial cryptography and data security. Springer, pp 119–135Google Scholar
  25. 25.
    Messerges TS, Dabbish EA, Sloan RH (1999) Power analysis attacks of modular exponentiation in smartcards. In: Cryptographic hardware & embedded systems, pp 144–157Google Scholar
  26. 26.
    Moradi A, Hinterwälder G (2015) Side-Channel security analysis of ultra-low-power FRAM-based MCUs. In: Proc. Int WkShp on constructive side-channel analysis & secure designGoogle Scholar
  27. 27.
    NVIDIA (2015) CUDA C Programming Guide.
  28. 28.
    Ors SB, Gurkaynak F, Oswald E, Preneel B (2004) Power-analysis attack on an ASIC AES implementation. In: Int. conf. on info. tech.: coding & computing, vol 2, pp 546–552Google Scholar
  29. 29.
    Örs SB, Oswald E, Preneel B (2003) Power-analysis attacks on an FPGA–first experimental results. In: Cryptographic hardware & embedded systems, pp 35–50Google Scholar
  30. 30.
    Pietro RD, Lombardi F, Villani A (2016) CUDA leaks: a detailed hack for CUDA and a (partial) fix. ACM Trans Embedded Comput Syst (TECS) 15(1):15Google Scholar
  31. 31.
    Szerwinski R, Güneysu T (2008) Exploiting the power of GPUs for asymmetric cryptography. In: Cryptographic hardware and embedded systems. Springer, pp 79–99Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringNortheastern UniversityBostonUSA
  2. 2.Department of MathematicsNortheastern UniversityBostonUSA

Personalised recommendations