Power Analysis Attack of an AES GPU Implementation
Abstract
In the past, Graphics Processing Unities (GPUs) were mainly used for graphics rendering. In the past 10 years, they have been redesigned and are used to accelerate a wide range of applications, including deep neural networks, image reconstruction and cryptographic algorithms. Despite being the accelerator of choice in a number of important application domains, today’s GPUs receive little attention on their security, especially their vulnerability to realistic and practical threats, such as sidechannel attacks. In this work we present our study of sidechannel vulnerability targeting a general purpose GPU. We propose and implement a sidechannel power analysis methodology to extract all the last round key bytes of an AES (Advanced Encryption Standard) implementation on an NVIDIA TESLA GPU. We first analyze the challenges of capturing GPU power traces due to the degree of concurrency and underlying architectural features of a GPU, and propose techniques to overcome these challenges. We then construct an appropriate power model for the GPU. We describe effective methods to process the GPU power traces and launch a correlation power attack (CPA) on the processed data. We carefully consider the scalability of the attack with increasing degrees of parallelism, a key challenge on the GPU. Both our empirical and theoretical results show that parallel computing hardware systems such as a GPU are vulnerable to power analysis sidechannel attacks, and need to be hardened against such threats.
Keywords
Sidechannel attack Correlation power analysis AES GPGPU1 Introduction
Graphics Processing Units (GPUs), originally designed for 3D graphics rendering, have evolved into high performance general purpose processors. Today, a GPU can provide significant performance advantages over traditional multicore CPUs by executing workloads in parallel on hundreds to thousands of cores. What has spurred on this development is the delivery of programmable shader cores, and highlevel programming languages [7], including CUDA and OpenCL. Since then, GPUs have been used to accelerate a wide range of applications [11], including: signal processing, circuit simulation, molecular modeling and machine learning.
Motivated by the demands of efficient cryptographic computation over large amounts of data, GPUs are now being leveraged to accelerate a number of cryptographic algorithms. Before the introduction of CUDA and OpenCL, Cook et al. [3, 4] made their first efforts of mapping an AES cipher to a fixed graphics pipeline using OpenGL. By using CUDA, Manavski [21] implemented AES on an NVIDIA GPU G80, achieving a speedup as high as 5.9 times, as compared to the fastest CPU at the time. Iwai et al. achieved approximately a throughput of 35Gbps (Gigabits per second) on a NVIDIA Geforce GTX285 [12]. Li et al. [17] achieved the highest performance, around 60Gbps throughput on a NVIDIA Tesla C2050 GPU, which runs up to 50 times faster than an Intel Core i7920. More recent work accelerated asymmetric ciphers by exploiting the power of GPUs [31]. Gilger et al. [10] implemented multiple block ciphers, both in CUDA and OpenCL. This provided an OpenSSL cryptographic engine solution that could easily accelerate common ciphers, and thus, reduces the development effort.
While the focus of prior work has been on accelerating crypographic implementations leveraging a GPU’s computation power, there is a little prior work that addresses the security of execution on a GPU. Di Pietro et al. [30] demonstrated that leakage of information can occur in a GPU’s shared memory, global memory and registers by using standard CUDA instructions. Maurice et al. [24] recovered data of a previously executed GPU application in a virtualized environment. Lombardi et al. [18] described how a GPUasaService in the Cloud can be misused and lead to denialofservice attacks and information leakage. However, sidechannel vulnerabilities of GPUs have received limited attention in the research community. Meanwhile, cryptographic systems based on other platforms, including microcontrollers [26], smart cards [25], applicationspecific integrated circuits (ASICs) [28] and FPGA platforms [20, 29], have all been shown to be highly vulnerable to sidechannel attacks. We are the first to conduct research on sidechannel analysis of GPUs. In our prior work [19], we presented the first power analysis of AES on a GPU, demonstrating the feasibility of an attack. Our group also launched the first timing attack of AES on a GPU [14].
Distinct from other computational platforms, the Single Instruction Multiple Thread (SIMT) model used on a GPU presents a range of challenges to sidechannel analysis. During execution, each thread can be in a different phase of execution, generating some degree of randomness (i.e., timing uncertainties and misalignment of power traces). In addition, the complexity of the GPU hardware system makes it rather difficult to obtain clean and synchronized power traces. The power consumption model is very complicated. To address these challenges, we develop effective methods to obtain clean power traces, and build a suitable sidechannel power leakage model to guide a successful power analysis attack. Our correlation power analysis (CPA) attack [19] demonstrates that AES128 developed in CUDA on an NVIDIA C2070 GPU is susceptible to power analysis attacks.
Our prior successful power analysis attack of GPU [19] was implemented in a highly controlled environment. where many threads are employed for computation, but they repeatedly work on the same blocks of plaintext data. Our results [19] mark an important step forward, demonstrating the feasibility of key recovery on a GPU. However, we acknowledge that the controlled attack environment in our prior work helped to limit the random noise and amplify the sidechannel power signal, and therefore increasing the signaltonoise ratio (SNR) to make the power analysis attack effective.
In this work, we investigate the robustness of our sidechannel power attack by utilizing different numbers of blocks of plaintext and increasing the degree of concurrency, further demonstrating the vulnerability of GPUs used for cryptographic computation in a more realistic setting. We extend our prior work significantly in both theory and experiments. We analyze multiple select functions, identify the best one for the attack, and provide quantitative analysis of why the chosen select function is successful. We analyze the scalability of the attack and explore how the size of the data, which translates to executing many parallel threads, impacts the attack success rate. We present evaluation results for attacks in much more realistic GPU execution scenarios, clearly demonstrating that the GPU incurs significant sidechannel power leakage.

We present a detailed analysis of power leakage and construct our power model – the power leakage is decomposed into three parts, two of which are nearly linear with the key byte and the third is nonlinear. By choosing different power models, the attack effectiveness (success rate) of CPA is analyzed.

We revisit the success rate model proposed by Fei [6] for CPA attacks on CPUs, and extend it to a parallel computing environment on a GPU, producing accurate predictions on the number of power traces needed to achieve a desired attack success rate.

We launch a large number of attacks while varying the degree of parallelism, examining the scalability of the attack. Both empirical and theoretical success rates are presented and analyzed. We examine how the degree of parallelism affects the effectiveness of our attack, and evaluate the success rate when the full capacity of the GPU is exploited in an AES implementation.
The rest of this paper is organized as follows. Section 2 provides background on CUDA and GPU architecture, AES ciphers, sidechannel attacks, and introduces the attack model. In Section 3, we describe our experimental setup for acquiring power traces. We build the GPU’s power leakage model in Section 4. In Section 5, we present the attack results of extracting the last round key, present our extended success rate model and employ it to quantify the effectiveness of the attack. In Section 6, we suggest countermeasures to avoid such attack. Finally we conclude the paper in Section 7.
2 Preliminaries
In this section we begin by describing the GPU hardware architecture and associated software programming model. Then we review the specific AES cipher considered in this work and its implementation in CUDA on an NVIDIA GPU. We also review the basics of correlation power analysis attacks, followed by the attack model we use in this work.
2.1 GPU Basics
During execution, thread blocks will be dispatched to the SMs such that the threads in the same block can communicate through local shared memory. Within one thread block, 32 threads are grouped in a warp. A warp is the smallest schedulable program unit. Threads in one warp run in synchronization. Due to data dependencies and control hazards, if a warp in execution is stalled, the warp scheduler will dispatch another warp in order to make the best use of hardware resources.
In summary, the host code prepares the data and sets up the runtime environment for the kernel to run on the GPU. The programmer writes the device code which explicitly divides the job into blocks and threads. The GPU scheduler will decide when and where the blocks and threads will run in parallel, based on the available hardware resources, the data dependencies present in the code, as well as the presence of any memory conflicts. For a welldesigned CUDA program, data dependencies and memory conflicts should be minimized, and the GPU should use all of its available resources and maximize parallel kernel execution.
2.2 AES and a CUDA Implementation of AES
AES [5] is a block cipher algorithm announced as the encryption standard by the National Institute of Standard and Technology in 2001. It is a symmetrickey cipher on a fixedsized block of data. AES consists of a variable number of rounds, depending on the key length. For key sizes of 128, 192 and 256 bits, AES has 10, 12 and 14 rounds, respectively. For AES128, one block of data is organized as a 4x4 array of bytes, termed the state. Each round is a sequence of four operations: SubByte, ShiftRow, MixColumn, and AddRoundKey, except for the initial and last rounds. The initial round has only an AddRoundKey, and the last round omits the Mixcolumn. All the round keys are derived from a single initial key by the key schedule.
Figure 3 shows the round operations for one column running as a single thread. The initial round is simply an XOR of the plaintext and the first round key. There are nine middle rounds for the 128bit AES encryption. Each thread takes one diagonal of the state as its round input, and maps each byte into a 4byte word through a Ttable lookup. These four 4byte words are XORed together with the corresponding 4byte round key bytes, and the result is stored in a column of the output state. The last round has no MixColumns operation, and so only one out of four bytes is kept after the Ttable lookup, making it equivalent to a SBox lookup operation and ShiftRow. AddRoundKey is then performed on the four remaining bytes.
To begin an AES encryption, the plaintext is first copied into the GPU global memory. Each thread will load its own data into local memory, based on its block id and thread id. After encryption is complete, the ciphertext in local memory is copied back into global memory, and then copied into CPU memory. In ECB mode, the encryption of each block of data is independent, and thus can be parallelized as much as possible, depending only on the size of the data and the available GPU resources.
2.3 Sidechannel Attack and Typical Correlation Power analysis
Sidechannel attack is a type of attack based on information gained from the physical implementation of a cryptosystem. Sidechannel information can include power consumption, electromagnetic emanation, timing information, and even sound [8]. Because the leaked information depends on the secret key, an attacker can utilize correlation to recover the key with a complexity less than brute force. The attack can be as simple as Simple Power Analysis (SPA) [15] using only a single power trace, reading the key bits directly by inspecting the temporal power variation. It can also be as complicated as Mutual Information Analysis [9], which is based on information theory.
In this paper, we use the Correlation Power Analysis (CPA) method to extract keys of AES128. It is based on the correlation between the observed power information generated by the hardware and the power estimation calculated from a power model (which is a function of the key). To calculate the correlation, the attacker runs the cipher multiple times with different input plaintexts, and each run generates a power trace. For a block cipher, processing each byte is independent of others, and therefore the attack can be conducted in a divideandconquer manner, retrieving the subkey bytes one by one. The power model estimates the deterministic part of the power consumption, e.g., the Hamming distance model for CMOS technology [1], which computes the number of logic changes (i.e., 0to1 or 1to0) based on the known plaintext (or ciphertext) and a guessed subkey byte value. Equipped with a large enough number of power traces, we compute the Pearson correlation coefficient on the trace data and computed data for each guessed subkey value. If the subkey guess is right, the calculated correlation tends to be higher than when the subkey guesses are wrong. For a subkey byte, one out of 256 possible values will be identified. For AES128 with 16 bytes of key, the entire iteration would only be 2048 (= 2^{8} × 16), much lower than the complexity of 2^{128} for a brute force attack.
2.4 Attack Model

We assume the attacker knows the encrypted ciphertext. For part of our analysis, we assume knowledge of the plaintext, where the size and value of the input message can be controlled.

The attacker can obtain the power consumption of the GPU for each encryption. Power traces can be obtained via measurement or onchip power sensors, locally or remotely.
3 Experimental Setup and Power Trace Acquisition
Power trace acquisition on a GPU is performed very differently than the approaches used on MCUs, FPGAs and ASICs [20, 26, 28] for a number of reasons. First, the power trace of a GPU contains much more noise for several reasons. Our measurement point on the ATX power supply is far away from the GPU silicon die power supply. On the GPU card, there are many DCDC units converting the 12V voltage into various other voltage values needed by the GPU, which introduces switching noise, so we need to filter out the desired power information by using large capacitors. The measured total power consumption of the GPU card also contains power consumption of the cooling fan, offchip memory, PCIE interface and many other auxiliary circuits. These unrelated sources of power consumption further contribute to the noise.
Second, since there is no GPIO (GeneralPurpose Input/Output), or a dedicated pin, on the GPU to provide a precise trigger signal to indicate the start or end of the encryption, the oscilloscope takes the rising edge of the power trace as the trigger signal. Because a power trace can be very noisy and its rising and falling edges are not as clean and uniform as we desire, it is challenging to consistently identify the beginning of an encryption in a trace. Therefore, the traces for different encryptions are not synchronized.
The last and most important issue is that the parallel computing behavior of the GPU may cause timing uncertainty in the power traces. The GPU scheduler may switch one warp out and bring in another at any time, and this behavior is not under the programmer’s control. Moreover, there are multiple streaming multiprocessors, each performing encryption concurrently and independently. These facts all pose significant challenges for GPU sidechannel power analysis.
4 Power Model Building
Next, we build the power leakage model of the GPU for CPA, which will provide us with the power estimation formula PE(k), where k is the key candidate. The correlation between PE(k) and the actual power traces is then used to find the secret key.
4.1 Hamming Distance Based Power Leakage Extraction
The principle of a sidechannel power analysis attack is that the power consumption of a cryptosystem is determined by keydependent internal state switchings. The power consumption of a CMOS circuit consists of static and dynamic power [13]. The static power persists as long as the circuit is powered on, due to the leakage of reversed pn junctions. The static power dissipation depends mainly on the temperature and working voltage, and less on the internal data. Hence the static power does not vary much and is treated as noise in the power model. The dynamic power is due to switching of voltages in circuit gate outputs (intermediate states). One part of this power is for charging and discharging the parasitic capacitance. Another part of the dynamic power is consumed by the short circuit formed by the PMOS and NMOS transistors to change the output voltage. In a simplified model, the magnitude of the dynamic power consumption is linear with the number of changing bits (i.e., the Hamming distance) of the intermediate state. Next we find the intermediate states of our AES GPU implementation that depend on the secret key and derive their Hamming distances.
Since the attacker knows the cipher byte c_{1}, he/she can calculate these Hamming distances from the cipher byte c_{1} and a guessed subkey byte k_{1} value, in reverse order. First s_{ o u t } = k_{1} ⊕ c_{1}, linearly depends on the subkey byte value. Then s_{ i n } can be recovered through looking up the inverse of the SBox table, which is nonlinear of the key. T_{ a } is the 4byte Ttable value looked up by s_{ i n }, which consists of four components, 2 ⊗ s_{ o u t }, 3 ⊗ s_{ o u t }, s_{ o u t }, s_{ o u t }, in different order according to the s_{ i n } byte position. All the three Hamming distances depend on the guessed key value.
4.2 GPU’s Power Leakage Model
In previous work targeting CPA on FPGAs and CPUs, usually the attacker uses a power model based on nonlinear Hamming distances, such as h_{ s } = HW(s_{ i n } ⊕ s_{ o u t }). Adopting such a power model is very effective for attack when the power traces can be aligned and the h_{ s }specific operation occurs at a fixed time t in all the power traces. The attacker just needs to analyze the power values at one time point. However, for our parallel computing GPU platform, there are many concurrent threads running. For the three Hamming distances in Eqs. 1, 2 and 3 for each thread, the first is nonlinear, while the other two are linear in terms of key dependencies. Given the nondeterminism of the hardware thread scheduler, the operations corresponding to different leakage Hamming distances may be executed at different times by each thread. The resulting power trace for the GPU contains multiple leakages at random time locations.
Since each thread’s power trace is misaligned with other threads’ traces, t_{i,j} can be shifted randomly for different threads. Given the parallel computing behavior of the GPU, it becomes very difficult to identify the exact value of t_{i,j}, since threads can be executing different instructions at any time.
To launch a CPA attack on a GPU using the average (total) power consumption model described above, the power traces will be processed accordingly, and the Pearson correlation coefficient is derived between the predicted and measured values. The effectiveness of a CPA attack on a highly parallel computing platform has to be evaluated first.
Our previous modeling work [6] showed that the success rate of CPA can be predicted by two factors: i) the physical sidechannel signaltonoise ratio and ii) algorithmic confusion coefficients (a metric defined to capture the key distinguishability due to the algorithm and the select function). The noise level tends to be higher in the CPA attack on a GPU since the summation over time includes much more noise due to irrelevant operations in multiple threads. The confusion coefficients also differ in the GPU, as we are targeting the sum of three Hamming distances. We next derive the confusion coefficients of AES on a GPU.
5 Key Discovery of GPU by Power Analysis Attacks
We next discuss our power analysis of the targeted GPUbased AES implementation. We first extract the full AES key under the chosenplaintext attack model, where the adversary can choose the size and content of the plaintext message. In this attack, the signal strength is boosted by redundant AES encryption instances of the same plaintext message, making full use of the GPU’s hardware resources to overcome the problem of high noise and the linearity in confusion coefficients. Then, we extend our analysis to the knowncipher attack model, where the attacker has no control of the input of plaintext, and predict the number of traces needed for specific attack success rates.
5.1 Full Key Extraction
To take advantage of the parallel computing structure of the GPU, we let each AES block encryption use L = 4 threads. We first set the message size to 8 blocks, requiring 32 concurrent threads, which we call an AES encryption instance. We consider the attack in a multithreaded environment by executing multiple concurrent AES instances. Since the power consumption of one AES encryption is very small, we first use 768 instances of AES processing the same message to increase the power consumption for the measurement. That is, we use 768 × 4 × 8 = 24,576 threads for each power trace. Across different power traces, the plaintext messages vary and are generated independently of each other.
As shown in Fig. 9, after 40,000 traces, the correct subkey clearly stands out, producing the largest (negative) correlation coefficient. The negative correlation coefficient is due to the usage of the voltage to represent the power consumption here. Lower voltage in fact means higher power consumption. In Section 5.2, we build a statistical model to estimate the success rate of getting the correct subkey for a specific number of traces.
We also observe that, although the correlation coefficient of the correct subkey stands out, some false subkey values result in very close correlation coefficients, no matter how many traces we collect and analyze. This is due to the fact that the power model (7) depends heavily (and linearly) on the key value, as discussed in the end of Section 4.2. Therefore, the false values for subkeys possessing small Hamming distances from the true value are not as easily distinguishable in a GPU setting, especially when compared to the same values obtained on a CPU or FPGA computing platform.
The select function for our attack (shown in Eq. 8) includes all three Hamming distances in the last round which are keydependent. In general, a nonlinear select function would result in higher and more concentrated confusion coefficients, which would lead to a higher success rate under the same SNR (according to the success rate prediction model given in [6]). With three Hamming distances, adding them up would reduce the noise and increase the SNR, and would also result in different confusion coefficients as compared to using only one Hamming distance. For the three Hamming distances h_{1}, h_{2} and h_{3}, h_{1} is nonlinear, and h_{2} and h_{3} are mostly linear with the subkey byte’s value. We generate seven different select functions based on different combinations of the three Hamming distances, calculate their corresponding confusion coefficients, derive SNRs from the measured traces, and plug them into the success rate prediction formula (detailed in next section).
5.2 A More Realistic Execution Environment
GPUs are powerful platforms, able to run thousands of concurrent threads. GPUs can be used effectively to accelerate AES encryption/decryption. In an actual AES implementation on a GPU, a large number of threads would be encrypting/decrypting different plaintext/ciphertext values concurrently. A highlytuned implementation of AES would try to utilize the full capacity of the GPU.
Given this more realistic scenario, the attacker does not have control over the plaintext. We would like to understand how successful the attack will be, as a function of the number of power traces collected. To begin extending our attack model to a more realistic execution environment, we build off our previous work presented above [19], and leverage the success rate model for CPA proposed by Fei [6], to predict the number of traces needed to launch a successful attack under this situation. Mangard in [22] also proposed a model to estimate the trace number, but it failed to consider the effects of false key candidates (confusion coefficients), which has a major impact in our attack.
Taking B = 8 in Fig. 12 as an example, for a realistic attack with no repeated AES encryption instances, to obtain the same success rate, the number of traces would need to be 756 (M/BL = 24, 576/(4 × 8)) times that of the reference attack. For example, to achieve a 70% success rate, the number of traces needed for the reference attack is approximately 33,000, while the number needed for a realistic attack would be 25.3 million (756 × 33, 000). Furthermore, when M > BL, the unutilized threads may be used to do other computation, which will increase the noise level without contributing any sidechannel signal, and thus we may need even more traces to obtain the same success rate.
6 Countermeasures
To defend AES on GPU from sidechannel power analysis, countermeasures should be applied. The common countermeasure of masking would also work on GPU. Since the intermediate values will be randomized by the mask, the attacker can not correlate the power with the model any more. However, another typical countermeasure, random delay of instructions, is not effective on GPU. One reason is that the instructions issued by different warps on GPU are already randomized in some way by the scheduler. Another reason is that the attacker can always average the entire trace to capture the leakage as it is done in this paper.
With sufficient key storage and adequate key management, users should avoid using the same secret key for all the encryption on GPU. Not necessarily different key for different data block, a couple of keys for the GPU would significantly increase the attack complexity and render the attack infeasible. Another effective countermeasure is to initialize the registers with random values before writing to them. It introduces some degradation of the performance, but we can limit the random initialization only to the sensitive registers, e.g., the last round registers, to minimize the effect. Also thanks to the high performance of the GPU, such a small overhead would be negligible.
7 Conclusion
In this paper, we present sidechannel power analysis on a GPU AES implementation. We describe a process to obtain power consumption measurements on an NVIDIA GPU. The various challenges of power analysis on a GPU are highlighted. To overcome these difficulties, we have proposed effective strategies to process the power traces for a successful correlation power analysis. The corresponding power model is built based on the CUDA PTX assembly code. We begin our analysis of the attack assuming control over the plaintext, and analyze its scalability as we increase the size of plaintext. We find a linear relationship between the amount of plaintext and the the number of traces needed, though the computation complexity grows exponentially. The attack results show that a GPU, a highlypopular but very complex, and parallel computing device, is vulnerable to sidechannel power analysis attacks.
References
 1.Brier E, Clavier C, Olivier F (2004) Correlation power analysis with a leakage model. InL: Cryptographic hardware & embedded systems, vol 3156, pp 16–29Google Scholar
 2.Clavier C, Coron JS, Dabbous N (2000) Differential power analysis in the presence of hardware countermeasures. Springer, Berlin, pp 252–263MATHGoogle Scholar
 3.Cook D, Keromytis AD (2006) Cryptographics: exploiting graphics cards for security, vol 20. Springer Science & Business MediaGoogle Scholar
 4.Cook DL, Ioannidis J, Keromytis AD, Luck J (2005) Cryptographics: secret key cryptography using graphics cards. In: Topics in cryptology–CTRSA 2005. Springer, pp 334–350Google Scholar
 5.Daemen J, Rijmen V (1998) AES proposal: RijndaelGoogle Scholar
 6.Fei Y, Ding AA, Lao J, Zhang L (2015) A statisticsbased success rate model for DPA and CPA. J Cryptogr Eng 5(4):227–243CrossRefGoogle Scholar
 7.Gaster B, Howes L, Kaeli DR, Mistry P, Schaa D (2013) Heterogeneous computing with openCL: revised openCL 1.2 edition, 2nd edn. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
 8.Genkin D, Shamir A, Tromer E (2014) RSA key extraction via lowbandwidth acoustic cryptanalysis. In: Advances in cryptology–CRYPTO 2014. Springer, pp 444–461Google Scholar
 9.Gierlichs B, Batina L, Tuyls P, Preneel B (2008) Mutual information analysis. In: Cryptographic hardware & embedded systems, pp 426–442Google Scholar
 10.Gilger J, Barnickel J, Meyer U (2012) GPUacceleration of block ciphers in the OpenSSL cryptographic library. In: Information security. Springer, pp 338–353Google Scholar
 11.Hwu WM (2011) GPU computing gems emerald edition, 1st edn. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
 12.Iwai K, Kurokawa T, Nisikawa N (2010) Aes encryption implementation on cuda gpu and its analysis. In: 2010 First international conference on networking and computing, pp 209–214. https://doi.org/10.1109/ICNC.2010.49
 13.Jan MR, Anantha C, Borivoje N (2003) Digital integrated circuits: a design perspectiveGoogle Scholar
 14.Jiang ZH, Fei Y, Kaeli D (2016) A complete key recovery timing attack on a gpu. In: 2016 IEEE International symposium on high performance computer architecture (HPCA), pp 394–405. https://doi.org/10.1109/HPCA.2016.7446081
 15.Kocher P, Jaffe J, Jun B, Rohatgi P (2011) Introduction to differential power analysis. J Cryptogr Eng 1(1):5–27CrossRefGoogle Scholar
 16.Leischner N, Osipov V, Sanders P (2009) Nvidia fermi architecture white paper. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
 17.Li Q, Zhong C, Zhao K, Mei X, Chu X (2012) Implementation and analysis of aes encryption on gpu. In: 2012 IEEE 14th International conference on high performance computing and communication, 2012 IEEE 9th international conference on embedded software and systems, pp 843–848. https://doi.org/10.1109/HPCC.2012.119
 18.Lombardi F, Di Pietro R (2014) Towards a GPU cloud: benefits and security issues. In: Continued rise of the cloud. Springer, pp 3–22Google Scholar
 19.Luo C, Fei Y, Luo P, Mukherjee S, Kaeli D (2015) Sidechannel power analysis of a GPU AES implementation. In: IEEE Int. Con. on computer design (ICCD). IEEE, pp 281–288Google Scholar
 20.Luo P, Fei Y, Fang X, Ding AA, Leeser M, Kaeli DR (2014) Power analysis attack on hardware implementation of MACKeccak on FPGAs. In: Int. Conf. on ReConFigurable computing and FPGAs (ReConFig), pp 1–7Google Scholar
 21.Manavski S (2007) CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In: IEEE Int. Conf. on signal processing & communications, pp 65–68Google Scholar
 22.Mangard S (2004) Hardware countermeasures against DPA – a statistical analysis of their effectiveness. Springer, Berlin, pp 222–235MATHGoogle Scholar
 23.Margara P (2015) EngineCUDA, a cryptographic engine for CUDA supported devices. https://code.google.com/p/enginecuda/
 24.Maurice C, Neumann C, Heen O, Francillon A (2014) Confidentiality issues on a GPU in a virtualized environment. In: Financial cryptography and data security. Springer, pp 119–135Google Scholar
 25.Messerges TS, Dabbish EA, Sloan RH (1999) Power analysis attacks of modular exponentiation in smartcards. In: Cryptographic hardware & embedded systems, pp 144–157Google Scholar
 26.Moradi A, Hinterwälder G (2015) SideChannel security analysis of ultralowpower FRAMbased MCUs. In: Proc. Int WkShp on constructive sidechannel analysis & secure designGoogle Scholar
 27.NVIDIA (2015) CUDA C Programming Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
 28.Ors SB, Gurkaynak F, Oswald E, Preneel B (2004) Poweranalysis attack on an ASIC AES implementation. In: Int. conf. on info. tech.: coding & computing, vol 2, pp 546–552Google Scholar
 29.Örs SB, Oswald E, Preneel B (2003) Poweranalysis attacks on an FPGA–first experimental results. In: Cryptographic hardware & embedded systems, pp 35–50Google Scholar
 30.Pietro RD, Lombardi F, Villani A (2016) CUDA leaks: a detailed hack for CUDA and a (partial) fix. ACM Trans Embedded Comput Syst (TECS) 15(1):15Google Scholar
 31.Szerwinski R, Güneysu T (2008) Exploiting the power of GPUs for asymmetric cryptography. In: Cryptographic hardware and embedded systems. Springer, pp 79–99Google Scholar