Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The need for secure and efficient implementations of cryptography for embedded systems has been an active area of research since at least the birth of public-key cryptography. While considerable progress has been made over the last years, with development of many cryptographic engineering techniques for optimizing and protecting implementations of both symmetric [24] and asymmetric algorithms [9], the emergence of the Internet of Things (IoT) brings new challenges. The concept assumes an extraordinary amount of devices connected to the Internet and among themselves in local networks. Devices range from simple radio-frequency identification (RFID) tags to complex gadgets like smartwatches, home appliances and smartphones; and fulfill a wide variety of roles, from the automation of simple processes to critical tasks such as traffic control and environmental surveillance [5].

In a certain sense, the IoT is already here, as the number of devices storing and exchanging sensitive data rapidly multiplies. Realizing the scale in which security issues arise in this scenario poses challenges in terms of software security, interoperable authentication mechanisms, cryptographic algorithms and protocols. The possible proliferation of weak proprietary standards is particularly worrying, aggravated by the fact that IoT devices are many times physically exposed or widely accessible via the network, which opens up new possibilities of attacks making use of side-channel leakage. These leaks occur through operational aspects of a concrete realization of the cryptographic algorithm, such as the execution time of a program [14, 25]. Consequently, securely implementing cryptographic algorithms in typical IoT devices remains a relevant research problem for the next few years, which is further complicated by the limited availability of resources such as RAM and computational power in these devices.

In order to fulfill the need for cryptographic implementations tailored for resource-constrained embedded devices, many different lightweight algorithms have been proposed for various primitives. One such proposal is the PRESENT block cipher [11], a substitution-permutation network designed by Bogdanov et al. and published in CHES 2007, that has received a great deal of attention from the cryptologic community and was standardized by ISO for lightweight cryptographic methods [37]. The block cipher has two versions: PRESENT-80 with an 80-bit key, and PRESENT-128 with a 128-bit key, both differing only by the key schedule, being one of its main design goals to optimize the hardware implementation. In this work, we focus on this block cipher, providing an alternative formulation of the original PRESENT algorithm. We discuss why our formulation is expected to be more efficient in software and provide implementation results that support this claim. Also, we analyze the impact of using a second-order masking scheme as a side-channel leakage countermeasure.

Our Contributions. We introduce a new portable and secure software implementation of PRESENT that leads to significant performance improvement compared to previous work. The main idea consists in optimizing the computation of permutation P in two consecutive rounds, by replacing it with two more efficient permutations \(P_0\) and \(P_1\) in alternated rounds. In this work, side-channel resistance is implemented through constant time execution and masking countermeasures. Our implementations are evaluated on embedded ARM processors, but the techniques should remain efficient across platforms. Extensive experimental results provided on both Cortex-M microcontrollers and more powerful Cortex-A processors indicate that we obtained the fastest side-channel resistant implementation of PRESENT for our target architectures.

Organization. Section 2 reviews related work on software implementation of PRESENT and Sect. 3 describes the original specification of the block cipher. Novel techniques for efficient software implementation are discussed in Sect. 4, security properties and side-channel countermeasures in Sect. 5. Section 6 describes our target platforms, relevant aspects about our implementation and present the performance figures we obtained, before comparing them with results from the open research literature. Conclusions are drawn in Sect. 7.

2 Related Work

The design of PRESENT [11] has motivated an extensive amount of research in the cryptologic community, both in terms of cryptanalysis and engineering aspects. The main results in these regards are summarized here.

Starting from the cryptanalytic results, many techniques have been explored to break PRESENT’s security claims [10, 15, 27, 38], and, yet, the best full-round attack found is a biclique attack [27] able to recover the secret key based on \(2^{79.76}\) encryptions of PRESENT-80 or \(2^{127.91}\) encryptions of PRESENT-128. Although the result is technically a proof that PRESENT is not an ideally secure block cipher, it actually helps building up confidence in the cipher design. After extensive research efforts, the best known attack still requires almost as much computational effort as a brute-force attack.

Regarding the efficient implementation of PRESENT, one of the most comprehensive works is the PhD thesis by Axel Poschmann, one of PRESENT’s designers [33]. The author discusses a plethora of implementation results, both in hardware and in software, for a wide selection of architectures, ranging from 4-bit to 64-bit devices. For the software implementations, the author presents different versions optimized for either code size or speed. He focuses on implementing the S-box as a lookup table, which is potentially vulnerable to timing attacks in processors equipped with cache memory. Hence, the optimizations introduced to improve the S-box performance cannot be used in our work, because we are concerned with side-channel security.

In [31], Papapagiannopoulos et al. present efficient bitsliced implementations of PRESENT, along with implementations for other block ciphers, having as target architecture the ATtiny family of AVR microcontrollers. This work employs an extension [17] of Boyar-Peralta heuristics [13] to minimize the complexity of digital circuits applied to PRESENT, providing a set of 14 instructions to compute the S-box. Bao et al. [6] adapt the approach to implement the inverse S-box in 15 instructions for the LED cipher, which shares the same substitution layer with PRESENT.

Similarly to [31], Benadjila et al. [7] also provide bitsliced implementations for many different block ciphers, including PRESENT, but this time for Intel x86 architectures. One of the primary focuses of this work is the usage of SIMD instructions to speed up the implementations through vectorization.

It is also important to cite the work of Dinu et al. [18], which implements and optimizes PRESENT alongside with twelve other different block ciphers for three different platforms: 8-bit ATmega, 16-bit MSP430 and 32-bit ARM Cortex-M3. Their best results for PRESENT were obtained through a table-based implementation that merges the permutation layer and the substitution layer of the cipher in some instances. Since the Cortex-M3 is also one of the target architectures for our work, it is relevant to observe actual figures in this case. For this platform, the authors report an execution time of 16,919 clock cycles for encrypting 128 bits of data in CTR mode and 270,603 cycles for running the key schedule, encrypting and decrypting 128 bytes of data in CBC mode.

Out of all the aforementioned works, none of them discusses side-channel security and many even explicitly state the usage of large tables to compute the PRESENT S-box, which is a well-known source of side-channel leakage [12]. However, there are some researchers who address this issue. For example, [22] presents a bitsliced implementation for PRESENT that uses a masking scheme to provide second-order protection against side-channel attacks. The authors use a device endowed with a Cortex-M4 processor and report an execution time of 6,532 cycles to encrypt one 64-bit block, excluding the time consumed by the random number generator in the masking routine. They also provide experimental evidence for the effectiveness of masking as a side-channel attack countermeasure in ARM-based architectures. It is worth noting, however, that the masking scheme used by the authors only aims to protect the S-box computation, hence leaving the key unmasked and the algorithm open to possible attacks that might target specific sections of the code.

At last, we mention the paper [32], which applies a technique called Threshold Implementation to counteract differential power analysis attacks and glitches on hardware circuitry. This alternative masking scheme, originally proposed by Nikova et al. [29], has the advantage of not requiring the generation of random bits for computing operations between shares of secret information, but demands the evaluation of multiple S-boxes which can become computationally expensive in software.

3 The PRESENT Block Cipher

The PRESENT block cipher [11] is a substitution-permutation network (SPN) that encrypts a 64-bit block using a key with 80 or 128 bits. The key is first processed by the key schedule to generate 32 round keys \(subkey_1, ..., subkey_{32}\) with 64 bits each. To encrypt a given block of data, it repeats the following steps over 31 rounds: the block is XORed with the corresponding round key; each contiguous set of 4 bits in the block is substituted according to the output of the substitution box (S-box) S; and then the 64 bits are rearranged by a permutation P. After the final round, the block is XORed with \(subkey_{32}\). A high-level description of PRESENT encryption is given in Algorithm 1.

figure a

The S-box S acts over every 4 bits of the block, as specified in Table 1. Although the most straightforward way to implement the S-box in software is by using a lookup table, [31] shows how to simulate one evaluation of this function by performing 14 Boolean operations over the 4 input bits. Listing 1.1 contains a C-language implementation of the S-box and also of the inverse of this S-box, which can be useful for the decryption algorithm. The S-box was directly obtained from [31] using the extended Boyar-Peralta heuristics [13]. We computed the inverse S-box using the same approach with software from Brian Gladman [21]. Our inverse S-box has 15 instructions and reproduces the same number obtained by Bao et al. [6], in which the function was not explicitly given.

Table 1. PRESENT S-box, given in hexadecimal notation.
figure b

The permutation P is specified by Eq. 1 below and moves the i-th bit of the state to the position P(i):

$$\begin{aligned} P(i)={\left\{ \begin{array}{ll} 16i \mod 63, &{} \text {if } i \ne 63,\\ 63, &{} \text {if } i = 63. \end{array}\right. } \end{aligned}$$
(1)

From the definition of P, one can easily verify that \(P^2 = P^{-1}\). By looking at Fig. 1, another interesting property of this permutation can be noticed: if the 64-bit state of the cipher is stored in four 16-bit registers, the application of the permutation P aligns the state in a way that the concatenation of the i-th bit of each of the four registers of the permuted state corresponds to 4 consecutive bits of the original state. These properties will be explored by the technique proposed later.

Fig. 1.
figure 1

Matrix representation of the 64-bit input block B and its permutation P(B), both split into four 16-bit rows.

4 Efficient Implementation

The main novelty introduced in this work lies in the techniques devised to efficiently implement the PRESENT block cipher in software, which are now described. First, we limit the scope to PRESENT-80, the version using an 80-bit key, which is better suited for lightweight applications due to a smaller memory footprint. The encryption and decryption routines are exactly the same for the 128-bit version, the only difference is in the key schedule, which should not be a critical section of the algorithm in terms of performance. In fact, applying the same techniques exposed here to PRESENT-128, provides, within a 5% margin, the same time measurements for all scenarios we consider.

Algorithm 2 specifies our proposal for implementing encryption of a single block with PRESENT. Essentially, every two applications of permutation P are replaced by evaluations of permutations \(P_0\) and \(P_1\), which satisfy the property that \(P_1 \circ P_0 = P^2\), a fact that preserves the correctness of the modified algorithm. The way \(P_0\) and \(P_1\) act upon the cipher state is represented in Fig. 2, and code in the C programming language to implement both permutations follows in Listing 1.2. On the description of this algorithm, we use the function \(S_{BS}\), which we define as being the same S-box used for PRESENT, but taking as inputs state bits whose indexes are congruent modulo 16 instead of every four consecutive bits. In other words, this S-box interprets the state of the cipher as four 16-bit words and operates on them in a bitsliced fashion.

figure c

We need to observe two facts to prove the equivalence between Algorithms 1 and 2. First, the S-box S in Algorithm 1 acts on the same quadruplets of bits that \(S_{BS}\) acts on Algorithm 2, since both \(P_0\) and P bitslice the state over 16-bit words. Then note that \(P(P(X \oplus subkey_{i}) \oplus subkey_{i+1}) = P^2(X \oplus subkey_{i}) \oplus P(subkey_{i+1}) = P_1(P_0(X \oplus subkey_{i})) \oplus P(subkey_{i+1})\), being that leftmost term exactly the transformation undergone by state X over rounds i and \(i+1\) on Algorithm 1 and the rightmost term the transformation undergone by state X over rounds i and \(i+1\), for i odd, on Algorithm 2, when we disregard the S-box step on both algorithms. Since the S-boxes operate equivalently and, without S-boxes, the algorithms are also equivalent, the proof is concluded.

Now, at first glance, it may not be clear why our alternative version for PRESENT is faster than the original one, but there are two main advantages. The first one is due to complexity in software. Permutations \(P_0\) and \(P_1\) are simply more software friendly, requiring less operations to be implemented, when compared to the permutation P. An evidence to corroborate this fact was obtained from the source code generator for bit permutations provided by Jasper Neumann in [28], estimating a cost of 14 clock cycles to execute either \(P_0\) or \(P_1\) and a cost of 24 cycles to execute P, when implemented optimally.

figure d
Fig. 2.
figure 2

Matrix representation of the 64-bit input block B and its permutations \(P_0(B)\) and \(P_1(B)\), all of them divided into four 16-bit rows.

Fig. 3.
figure 3

Diagram showing equivalent ways to implement two consecutive rounds (i and \(i+1\)) of PRESENT for encryption. The original specification for the block cipher corresponds to the leftmost diagram and the rightmost one corresponds to the version proposed here with alternating \(P_0\) and \(P_1\) permutations.

The second advantage of our proposal involves the application of the S-box. A careful analysis of Algorithm 2 leads to the conclusion that, at lines 6, 9 and 13, where the S-box is applied, the state of the variable C is not the same as the state to which the S-box is applied in Algorithm 1. At line 6, the state has undergone an extra \(P_0\) permutation in relation to the original formulation; and at lines 9 and 13 the state has undergone an extra evaluation of P. By looking at Figs. 1 and 2, it stands clear that, if the ciphertext is stored into four 16-bit registers, both P and \(P_0\) organize the state in such a way that every four consecutive bits are aligned in columns throughout those four registers, similarly to what would be seen in a fully bitsliced implementation. Therefore, an implementation following the structure in Algorithm 2 can make use of bitwise operations to simulate the S-box step, calculating sixteen S-box applications simultaneously.

The same rationale may be applied to generate other alternative versions of the PRESENT encryption algorithm. Figure 3 illustrates different versions of PRESENT obtained by interchanging S-box applications and permutations. In this figure, S represents the S-box applied over every four consecutive bits of the state and \(S_{BS}\) represents the S-box computed in a bitsliced fashion.

One last observation to further improve performance of the implementation in a 32-bit architecture is that two blocks of plaintext can be encrypted in CTR mode at once, organizing the state such that 32 S-boxes are calculated simultaneously instead of only 16. For a 64-bit architecture, the same strategy can be carried out to encrypt four blocks at once.

All of the algorithmic observations and implementation techniques discussed here extend directly to the decryption routine, as shown by Algorithm 3. The inversion of encryption is particularly simplified by the fact that \(P_0\) and \(P_1\) are involutory permutations, that is, \(P_0^{-1} = P_0\) and \(P_1^{-1} = P_1\). The involutory property of \(P_0\) and \(P_1\) has yet another advantage. Since \(P_1 \circ P_0 = P^2\) and \(P^2 = P^{-1}\), it follows that \(P = P_0 \circ P_1\), what might be used to reduce the code size of the implementation, because the permutation P does not need to be implemented provided that \(P_0\) and \(P_1\) have already been coded.

figure e

At last, it is important to notice that our proposal has the drawback of applying the permutation P to some of the round keys. Although, since typically many blocks of message are encrypted or decrypted with the same key, the key schedule routine should have a low impact on the algorithm’s practical performance, since it is executed only once for several executions of encryption/decryption routines.

5 Side-Channel Countermeasures

As commented previously, there has been extensive work on the cryptanalysis of PRESENT and the lack of significant advances provides evidence that the cipher is likely to fulfill the desired security goals. However, even if the block cipher design is ideally secure, a careless implementation may leak sensitive data during execution and undermine the security of the algorithm with its insecure realization.

Particularly, a major concern is side-channel attacks, that is, attacks which are crafted based on information obtained from the physical implementation of a cryptographic primitive. For instance, an attacker may gather data such as execution time of an algorithm [14, 19, 25], power consumption [30], sound produced by the hardware [20] or even magnetic radiation emitted during the computation [26] and, through these data, the attacker may gain access to sensitive information processed by the device under analysis.

It is worth noting that side-channel attacks are limited to situations where the attacker has physical access to the hardware executing the implementation or at least can interact with the device through the network. It is not completely unreasonable to ignore the possibility of such attacks when the implementation of the algorithm is physically protected from the attacker or not accessible for any kind of interaction, but reality tends to go in the opposite direction in the IoT context. In this scenario, devices are frequently accessible to the attacker by either physical means or through the network and typically lack tamper-resistance countermeasures for protecting the hardware from external influence.

5.1 Protecting Against Timing Attacks

The focus is primarily on timing attacks, since they are entirely within the scope of software implementation, and appear to be the most practical side-channel attack. Furthermore, protecting software implementations from more invasive side-channel attacks is very challenging, since the software countermeasures can be typically circumvented by an invasive attacker. Recent work has developed static analysis tools to detect variances in execution time correlated with secret information at a rather low level [2, 34], allowing implementers to formally guarantee constant execution time of their code or at least implement mitigations.

In practice, the main sources of timing vulnerabilities are memory accesses and conditional branches depending on secret data. Conditional branching, by definition, may cause different instructions to be executed among different runs of a program, which, by its turn, may cause the execution time of the algorithm to depend on sensitive data given as input. The effect of branch misprediction in more sophisticated processors may further interfere with pipelined datapaths and provoke significant variations [1]. In a similar way, if a processor is equipped with cache memory, the execution time may leak information about the rate of cache misses or hits during memory accesses, and, clearly, if these accesses depend on sensitive data, this implementation becomes susceptible to side-channel attacks [8]. Therefore, by avoiding these situations, a software implementation can encrypt a message block in constant time, independently of characteristics about the inputs (plaintext message or cryptographic key). This runtime property is called isochronicity.

5.2 Masking the Implementation

Ensuring that code runs in constant time is sufficient to render timing attacks impractical, although other side-channel leakages might still be exploited. Another family of techniques for improving side-channel resistance is called secret sharing, or masking, which consists in splitting sensitive variables occurring in the computation into \(d+1\) shares (or masks) in order to unlink the correlation between environmental information and the secret data being processed. A masking technique based on \(d+1\) masks is said to be a d-th order masking and can only be broken by an attacker who manages to obtain leakage related to at least \(d+1\) intermediate variables of the algorithm. It is possible to prove that the difficulty for a side-channel attack to succeed, in practice, increases exponentially with d and, hence, the masking order can be considered a sound criterion to evaluate the robustness of the implementation against side-channel analysis [16].

The literature presents different alternatives to implement a masked encryption algorithm [32], but analysis will be restricted to the proposal given by Ishai et al. in [23], which appears to be the most appropriate for a fast software implementation. In this proposal, the masked state of a sensitive variable m with \(d+1\) shares is

$$\begin{aligned} m = \bigoplus \limits _{i=0}^{d}m_i = m_0 \oplus m_1 \oplus \ldots \oplus m_d, \end{aligned}$$
(2)

where each \(m_i\) is a share of the secret and all shares form together a masked secret. In order to create a masked implementation on the variable m, one can randomly generate the d masks \(m_1, m_2, ..., m_d\) and calculate \(m_0\) such that Eq. 2 holds.

From this definition, we can derive ways to calculate different operations over the masks. The following list contains all operations necessary to implement a masked version of PRESENT.

  1. 1.

    A NOT operation over a masked secret has to be carried out as a NOT operation performed on an odd number of masks to preserve the relationship in Eq. 2. A single mask can just as well be negated:

    $$\lnot m = \lnot m_0 \oplus m_1 \oplus \ldots \oplus m_d.$$
  2. 2.

    An XOR operation between masked secrets \(a = \bigoplus \limits _{i=0}^{d}a_i\) and \(b = \bigoplus \limits _{i=0}^{d}b_i\) can be performed by calculating the XOR of all corresponding masks:

    $$ a \oplus b = \bigoplus \limits _{i=0}^{d}(a_i \oplus b_i).$$
  3. 3.

    An AND operation between two masked secrets is more complicated and can be computed as follows: for every pair (ij), \(1\le i < j \le d+1\), generate a random bit \(z_{i,j}\). Then, compute \(z_{i,j}=(z_{i,j} \oplus a_i b_i )\oplus a_j b_i\). Now, for every \(1 \le i \le d+1\), the i-th share may be computed as

    $$ m_i = a_i b_i \oplus \bigoplus _{i \ne j} z_{i,j}.$$
  4. 4.

    An OR operation might be calculated using the logical identity OR \((a,b)=\lnot (\lnot a \cdot \lnot b)\), which depends only on operations previously defined.

The nonlinear operations OR and AND stand out as the most expensive ones, requiring \(O(d^2)\) calls to a random bit generator and memory to store a matrix z of \(O(d^2)\) entries. This is the main drawback of the technique in resource-constrained devices and makes the use of high-order masking impractical in many scenarios.

6 Implementation Details and Results

6.1 Target Architecture

Currently, there is a vast variety of processors under consideration for integration to the IoT. The focus given in this work is on some representatives of the ARM architecture, since it is the world leader in the market of microprocessors and, thus, attracts relevant academic work as well as commercial interest. More specifically, our implementations were benchmarked on the following platforms:

  • Cortex-M0+: Arduino Zero powered by an Atmel SAMD21G18A ARM Cortex-M0+ CPU, clocked at 48MHz.

  • Cortex-M3: Arduino Due powered by an Atmel SAM3X8E ARM Cortex-M3 CPU, clocked at 84MHz.

  • Cortex-M4: Teensy 3.2 board containing a MK20DX256VLH7 Cortex-M4 CPU, clocked at 72 MHz.

  • Cortex-A7/A15: ODROID-XU4 board containing a Samsung Exynos5422 2GHz Cortex-A15 and Cortex-A7 octa-core CPU.

  • Cortex-A53: ODROID-C2 board containing an Amlogic 64-bit ARM 2GHz Cortex-A53 (ARMv8) quad-core CPU.

Members of the Cortex-M [4] family are commonly used in embedded applications, being found on devices ranging from medical instrumentation equipment to domestic household appliances. The design of these processors is optimized for cost and energy efficiency, making them relatively low-end when compared to the other targets.

As for the members of Cortex-A [3] family, they are more computationally powerful than the Cortex-M processors, being able to execute complex tasks such as running a robust operating system or a high-quality multimedia task. These processors have access to the NEON engine, a powerful Single Instruction Multiple Data (SIMD) extension, and may have sophisticated out-of-order execution.

6.2 Main Results

In order to discuss our results, the code size and speed of our implementations are measured in two scenarios based on what is proposed in the FELICS framework [18], such that results can be comparable in a fair and reliable manner.

Scenario 1 simulates a communication protocol established in sensor networks or between IoT devices. It is assumed here that the device possesses the master key stored in RAM, calculates the key schedule and then proceeds to encrypt and decrypt 128 bytes of sensitive data using the CBC mode of operation. Due to the employment of the CBC mode, the suggested trick of encrypting more than one block in parallel does not work, since this mode of operation forces dependencies between consecutive input blocks. Hence, it stands clear that it is not the optimal scenario to use our techniques, but we still chose to implement it exactly as described in [18] for the sake of comparison.

Scenario 2 simulates an authentication protocol in which the block cipher is used to encrypt 128 bits of data in CTR mode of operation. The round keys are assumed to be stored in memory and, consequently, no key schedule is required. This is a very appropriate stance to employ all of the optimizations proposed so far, since the CTR mode encrypts and decrypts blocks of input independently.

Results for both scenarios are expressed in Tables 2 and 3. All the measurements were based on code fully written in C language, compiled by GCC 6.3.1 in the case of the Cortex-A family and by GCC 4.8.4 for the Cortex-M family, using the flag \(\texttt {-O3}\) for optimized speed results. The isochronicity property of the constant time implementations was validated using the FlowTracker static analysis tool [34]. FlowTracker performs information flow analysis from function inputs marked as secret to branch instructions and memory addresses, effectively detecting and thwarting timing attacks. This tool analyzes compiled code at the LLVM Intermediate Representation level, thus closer to the platform-specific native code. All timings for Cortex-M processors were reproduced to a reasonable degree in the ARM Cortex-M Prototyping System (MPS2), an FPGA-based board with support to microcontrollers ranging from the Cortex-M0 to M7. However, we only report timings collected in the widely available platforms to simplify comparisons with future competing implementation efforts.

Table 2. Performance results for Scenario 1 – key schedule, encryption and decryption of 128 bytes in CBC mode – of side-channel resistant implementations of PRESENT, encompassing both isochronous (constant time) and second-order masking countermeasures.
Table 3. Performance results for Scenario 2 – encryption of 128 bits in CTR mode – of side-channel resistant implementations of PRESENT, encompassing both isochronous (constant time) and second-order masking countermeasures.

One of the main observations attained from these measurements is that the cost to protect the implementations with masking is high, especially in lower-end processors. In our case, a second-order masking was used and the time consumed by the random number generator was disregarded. Still, a slowdown of up to 6.8 times was observed in the case of the Cortex-M0+. For higher-end processors, however, the slowdown can be inferior to a 4-factor. Throughout all processors, a sensible increase in code size due to masking is observed.

Another fact to notice is that, as expected, even when differences in input size are taken into account, the performance of PRESENT in Scenario 2 is substantially better than the performance in Scenario 1, mainly due to the choice of mode of operation. In Scenario 1, using the CBC mode, only decryption can be parallelized, and encryption ends up being roughly twice as slow as in CTR mode.

6.3 Vector Implementation Using NEON

For the platforms with access to NEON instructions, parallelism within the PRESENT encryption algorithm can also be explored for enhancing performance. In particular, it is relevant to mention that the NEON instructions VTBL and VTBX allow the computation of fast table lookups by performing register operations, without the need of memory accesses.

Besides the original formulation of the algorithm, that implements S-boxes as lookup tables, we were also able to evaluate the performance of a different proposal mentioned in [33] and attributed to Gregor Leander. The idea is similar to ours, in principle, since it decomposes the permutation P into two others. However, Leander’s decomposition aims to allow a faster lookup table-based implementation, which is the opposite direction we are looking for. Still, even using the NEON instructions to implement the lookup tables used in Leander’s method, our formulation was found to be faster.

NEON implementations can process eight blocks simultaneously due to the support of 128-bit registers, in the same fashion as processing two blocks in parallel in 32-bit processors or four blocks in parallel using 64-bit ones. For this reason, neither scenario used previously is appropriate to evaluate vector implementations. Scenario 1 does not support parallelism due to the mode of operation employed and Scenario 2 processes only 128 bits of data, which is only two blocks of input, not making use of the full capacity of processing eight blocks at once.

For this reason, we chose to analyze the performance of our NEON implementations under a third scenario, in which we run the key schedule, encrypt and decrypt 128 bytes of data. These results are reported in Tables 4 and 5, alongside with the results of the native implementation, without vector instructions, to provide a baseline for comparison.

Table 4. Performance results for isochronous execution of the key schedule, encryption and decryption of 128 bytes of data in CTR mode, using both serial and vectorized code.
Table 5. Performance results for execution of the key schedule, encryption and decryption of 128 bytes of data in CTR mode, using both serial and vectorized code, protected by second-order masking.

By analyzing the results, we notice that the NEON instructions were able to provide a meaningful speedup for the 32-bit processors. For the 64-bit Cortex-A53, however, the efficiency of native instructions associated with the possibility of processing four blocks in parallel beats the vector implementation by a small margin. Naturally, these implementations have a substantial impact on code size when compared to Table 2.

Notice also that the only difference introduced by this third scenario compared to Scenario 1 is the choice of the mode of operation. It further illustrates how much better CTR performs in this case, in which we can make use of the parallelism intrinsic to the encryption routine.

6.4 Comparison with Related Work

Although many implementation results for PRESENT are published, we focus here on comparing our metrics to the works of [18, 22], which are, to the best of our knowledge, the most efficient publicly available implementations of PRESENT in similar platforms to the ones we use.

In [18], a series of implementations is presented for many block ciphers which are benchmarked on a Cortex-M3 processor. For a scenario identical to the Scenario 2 we described, they report an execution time of 16,786 clock cycles and a code size of 3,568 bytes. Our results are almost 8 times better considering the execution time, and over 30% better regarding the code size. They also measure these metrics for Scenario 1, in which they report the usage of 270,603 cycles of execution and 2,528 bytes of code, which is slower and more space-consuming than our implementation, but by a smaller margin, since the CBC mode of operation employed in this case does not benefit from some of the optimizations.

The work of [22] showcases a bitsliced implementation of PRESENT on a Cortex-M4, protected by a second-order masking. It claims to be able to encrypt one input block in 6,532 cycles. We argue that our results are better, since, even if there is no penalty caused by the tight coupling with a mode of operation, it would encrypt 128 bits of data in 13,064 cycles, which is slower than the 11,096 cycles we achieved for the same processor on Scenario 2. Furthermore, since this implementation has a bitslice factor of 32, it cannot actually encrypt only 128 bits of data without having to do extra work, whereas our implementation is not only faster, but more flexible in the sense that it allows small amounts of data to be efficiently encrypted.

It is also relevant to take into consideration performance results from other block ciphers to gauge how useful our techniques may be in practice. In particular, we take a closer look at AES, arguably the most extensively used block cipher today and which has been originally praised for its good performance in software [35]. The current state-of-the-art implementations for AES on Cortex-M processor are from [36], in which several different results are presented. Table 6 compares our results to theirs when encrypting 128 bits of data through CTR mode in constant time. We notice that PRESENT is slower than AES on Cortex-M3, but slightly faster on Cortex-M4 and, on both processors, PRESENT’s code footprint is several times smaller.

Table 6. Comparison between our results for PRESENT and results from [36] for AES when encrypting 128 bits of data in CTR mode, in constant time.

7 Conclusion

In this work, we presented a novel technique for accelerating encryption and decryption using the PRESENT block cipher. Our modified algorithm is expected to be faster in software when compared to the original PRESENT specification for many platforms and, indeed, our experimental data supports that we were able to significantly outperform state-of-the-art results for processors within the ARM Cortex-M family. This makes PRESENT competitively efficient even when compared to secure implementations of widely used software-oriented ciphers such as AES.

Furthermore, our proposal has the advantage to be readily implemented in constant time, which is relevant in contexts where there is concern regarding side-channel attacks. For further side-channel security, we implemented and analyzed the performance impact of a second-order masking scheme.

At last, we show that our technique can also be applied to vector implementations – using the ARM-NEON extension, for example – to achieve even higher performance gains in some compatible platforms.