Efficient Design Strategies Based on the AES Round Function
 5 Citations
 972 Downloads
Abstract
We show several constructions based on the AES round function that can be used as building blocks for MACs and authenticated encryption schemes. They are found by a search of the space of all secure constructions based on an efficient design strategy that has been shown to be one of the most optimal among all the considered. We implement the constructions on the latest Intel’s processors. Our benchmarks show that on Intel Skylake the smallest construction runs at 0.188 c/B, while the fastest at only 0.125 c/B, i.e. five times faster than AES128.
Keywords
Fast software implementation AES AESNI Skylake1 Introduction
As a block cipher standard, the AES has inspired many cryptographic designs. Stream and block ciphers, authenticated encryption schemes (AEs), cryptographic hash functions and Message Authentication Codes (MACs) based on the AES benefit from its two main features, namely, its security and efficiency. The security benefit is twofold. First, as the AES is the most popular block cipher, it has been extensively analyzed and its security is well understood [9, 14, 15]. Second, the AES is based on the socalled widetrail strategy [6], which provides resistance against the standard differential and linear attacks. The efficiency benefit is significant as well. Due to its internal structure, the AES allows fast software implementations based on lookup tables as well as even more efficient bitsliced implementations [12]. Furthermore, the latest mainstream processors have a dedicated set of instructions, called AESNI, that provides a complete implementation of the AES. These handy instructions allow with a few lines of code to execute one block cipher call with exceptionally high efficiency (measured in cycles per byte of data or c/B). For instance, on the same architecture, the tablebased implementation of AESCTR runs at around 10 c/B, its bitsliced implementation at around 7.5 c/B, while its AESNI implementation at less than 1 c/B. As significant speedups are observed when AESNI are available, it is important to understand how far we can benefit from them.
Depending on the security requirements and adversarial model, designs based on the AES may use roundreduced version of the block cipher. For instance, PelicanMAC [8], AlphaMAC [7], LEX [1], ASC1 [11], and ALE [3], use only four rounds of the AES to process one message block (cf. to the ten rounds in the original AES128 block cipher). Obviously, the reduction in the number of rounds has a direct impact on the efficiency and these designs run at much higher speed. The decision to reduce the number of rounds to four stems from the widetrail strategy, since in some cases four rounds already provide sufficient level of security. Only a few designs use less than four rounds, as the security analysis becomes more intricate.
Our Contributions. We examine AESbased constructions that can be used as building blocks of secretkey primitives (e.g., MACs and authenticated encryption schemes). Our main goal is to push the limits of efficiency of constructions that can be implemented with the AESNI, without sacrificing their security.
As reference points and benchmarks, we use the two authenticated encryption schemes AEGIS128L and Tiaoxin346 submitted to the CAESAR competition [5]. These schemes, not only rely on roundreduced AES (to process 16byte message block, AEGIS128L uses four rounds, while Tiaoxin346 only three rounds of AES), but allow as well a full parallelization of the round calls. As a result, with AESNI implementation they achieve exceptionally high efficiency and run at only 0.2–0.3 cycles per byte of message.
To understand the speed advantage of these designs, first we focus on AESNI. We investigate the performance of the AESNI instruction aesenc (executes one round of AES) on the latest Intel processors and deduce necessary conditions for efficient designs. Consequently, our designs have internal states composed of several 128bit words (called blocks), while their step functions are based only on aesenc and bitwise additions (XORs). The state size, the number of aesenc calls per step, and the choice of state words to which aesenc is applied ensures that our designs will have a high efficiency.
Next, we focus on the security of the designs. The most common attacks for MACs and AE are internal collisions based on high probability differential characteristics that start and end in zero state differences (but some intermediate states contain differences, introduced through the messages). The inability of the adversary to efficiently built such collisions is the single security criteria required from our designs.
We consider two strategies that may lead to efficient and secure constructions. In the first, the AES rounds are applied to the words of the state in a way such that several steps of the construction mimic a few keyless AES rounds^{1}. Due to the widetrail approach, this strategy provides easier security proofs. However, we show that regardless of the step function chosen, such strategy has only limited efficiency potential. For instance, strategy based on 4round AES can never run faster than 0.25 cycles per byte.
To achieve higher speed, we thus consider a second strategy, where message and state words can be XORed between the AES calls. The widetrail approach cannot longer be used (as each application is oneround AES), hence the security proof for the constructions becomes much harder. To solve it, for each candidate construction we transform the collision problem into a MILP problem, and find the optimal solution which corresponds to the characteristic with the highest probability. The cases where such probability is too low correspond to secure constructions.
We search for suitable designs based on the second strategy by gradually increasing the state size and decreasing the number of AES rounds per step. In some cases, several constructions have the same efficiency but provide different security margin. We implement each construction on the latest Intel processors and check if the theoretical and actual cycle per byte count match. We list 7 secure constructions that provide a good tradeoff between state size and efficiency. The smallest has 6 words, and runs at 0.22 c/B on Haswell, and 0.188 c/B on Skylake. The most efficient has 12 words, and runs at 0.136 c/B on Haswell, and 0.125 c/B on Skylake. This construction uses only 2 AES rounds per one block of message, and thus it is five times faster than the AES.
2 Designs Based on the AES Round Function
2.1 The AES Round Function and the Instruction Set AESNI
AES is the current block cipher standard and a wellstudied cryptographic construction. As such, parts of AES are used in many crypto designs. The usage ranges from the utilization of the AES Sbox in some hash functions, to application of the AES round function in stream ciphers, and employment of the whole AES in particular authenticated encryption schemes. The AES contains three different block ciphers, which only differs by their key sizes: in the remaining of this paper, we simply write AES to refer to the 128bit key version AES128.
The latency and throughput of aesenc on the latest Intel’s processors.
Processor  Latency  Throughput 

Sandy Bridge  8  1 
Ivy Bridge  8  1 
Haswell  7  1 
Broadwell  7  1 
Skylake  4  1 
Our design strategies target the five latest Intel’s processors: Sandy and Ivy Bridge (collectively referred to as *bridge), Haswell and Broadwell (referred to as *well), and Skylake.
2.2 Efficiency
Our goal is to devise a strategy that results in designs based on aesenc that have a superior efficiency over the AES. Improvements in efficiency can come from two concrete approaches: reduction of the number of rounds per message block, and, parallelization of the aesenc calls. Let us take a closer look at the two approaches.
Reducing the Number of Rounds. The AES has 10 rounds^{3}, i.e. it uses 10 aesenc calls^{4} to process a 16byte message. Removing several rounds from the AES leads to a block cipher susceptible to practical attacks. This, however, does not imply that any design (not only a block cipher) should necessary use around 10 aesenc calls. In fact, a common approach based on the AES, is to design cryptographic primitives that use only four AES rounds to process 16byte data.
The goal of our design is to use a minimal number of calls to aesenc. For this purpose, we define a metric, called a rate of design:
Definition 1
(Rate). The rate \(\rho \) of a design is the number of AES rounds (calls to aesenc) used to process a 16byte message.
For instance, AES128 has a rate of 10, AES256 has a rate of 14, AEGIS128L has a rate of 4, and Tiaoxin346 a rate of 3. Obviously, a smaller rate may lead to more efficient designs.^{5}
Parallelizing the Round Calls. A large improvement in efficiency may come by switching from serial^{6} to parallel calls to aesenc.
Designs with parallel calls to aesenc can be far more efficient, as the instructions are executed simultaneously, i.e. the following aesenc can be called while one or more of the previous aesenc are still executing. The cycle count now depends not only on the number of rounds and the latency, but also on the throughput and the maximal number of independent instances of aesenc supported by the design. A textbook example of parallelizable construction is the counter (CTR) mode.^{8} On Haswell it is possible to process 7 message blocks in parallel (see Fig. 2): at cycle 0, aesenc is called and it will perform the first AES round for the first message block (and return the result at cycle 7); at cycle 1, aesenc for the first AES round of the second message block is called, etc., at cycle 6 the aesenc for the first round of the seventh message block is called. Then, aesenc that perform the second rounds for all the seven message blocks are called at cycles 7–13. By repeating this procedure, it is possible to perform all ten AES rounds for all 7 message blocks – the last rounds are executed at cycles 63–69, and the ciphertexts are produced at cycles 70–76. Hence only 76 cycles, which can be brought down to 70 if longer messages are considered, are required to process 7 message blocks, or on average only 10 cycles per one message block (cf. to 70 cycles for processing a message block in the serial CBC mode). Therefore, the CTR mode runs at \(10/16=0.625\) c/B, or precisely 7 times faster than the CBC mode.
The State Size and the Number of aesenc Calls per Step. The parallel calls to aesenc can be achieved only if the state size is sufficiently large. We have seen that CBC mode requires a state composed of only one 16byte word, but provides no parallelization. On the other hand, if supplied with a state of seven words, the CTR mode can run seven instances in parallel. As we strive for designs with high efficiency and thus support for parallel calls to aesenc, they will have larger states. In general, if the design makes c calls to aesenc per step, then the state has to have at least c 128bit words: only in this case we can have fully parallelizable aesenc calls.
The optimal number of aesenc calls per step depends on the latency to throughput ratio. The most efficient designs use around latency/throughput independent calls to aesenc per one step. Let us understand this fact on the example of a hypothetical design that has four aesenc calls per step to process 16byte message (has a rate of \(4/1=4\)) and is implemented on Haswell, which in turn has a ratio of \(7/1=7\). The four aesenc calls of the first step are called at cycles 0, 1, 2, and 3 (at every cycle because the throughput is 1), but the results of these calls are obtained only at cycles 7, 8, 9, 10 (because the latency is 7). As a result, at cycles 4, 5, and 6, no aesenc calls are made,^{9} and we say that the aesenc port^{10} has not been saturated, i.e. there have been empty cycles. Due to the empty cycles, even though the rate is 4, one needs 7 cycles on Haswell to process the message block, thus the speed is \(7/16=0.4375\) c/B. The cycle count changes when the same design is implemented on Skylake (with ratio \(4/1=4\)). On this processor, the aesenc port is fully saturated, and on average it requires only 4 cycles per 16byte message,^{11} which means that this design would run at \(4/16=0.25\) c/B.
A construction with rate \(\rho \) can run at most at \(0.0625 \rho \) c/B because, by definition, it needs \(\rho \) aesenc calls (in total at least \(\rho \) cycles) to process 16byte message, hence the maximal speed is \(\frac{\rho }{16} = 0.0625\rho \) c/B. On the other hand, if the number of aesenc calls per step is smaller than the latency to throughput ratio, then, for the aforementioned reasons, the aesenc port may not be saturated, and the speed may drop to \(0.0625 \frac{latency}{throughput}\) c/B. In the sequel, we take this number as our expected speed. The actual speed, however, may differ. It could be lower, if the aesenc between different steps are dependent, i.e. if the inputs to the aesenc of the next step depend on the outputs of the aesenc of the previous step. On the other hand, the actual speed could be higher than the expected, if more than \(\frac{latency}{throughput}\) aesenc could run at the same time – this happens, when some of the aesenc calls of the next step can start before finishing most of the aesenc of the previous step.

lower rate (#aesenc per message block) leads to more efficient designs,

all aesenc calls per step are independent and thus run in parallel,

the state is at least as large as the number of aesenc calls per step,

the #aesenc calls per step is close to the latency/throughput ratio.
2.3 Security Notions
We suggest design strategies to construct building blocks for symmetrickey primitives, and thus we adapt the security requirements accordingly. Our constructions proposed further, for instance, could be used to build a MAC algorithm, where an initialization phase first randomizes a 128bit key and IVdependent internal state to produce a 128bit tag by injecting message blocks. In such a case, classical security requirements impose that no keyrecovery or forgery succeeds in less than \(2^{128}\) operations. If an authenticated encryption scheme uses our building block with a 128bit key to produce a 128bit tag, then as well, less than \(2^{128}\) computations must not break the scheme.
Analyzing the resistance of a design against all possible attacks is infeasible without giving the full specification.^{12} To capture this, we reduce the security claim of our constructions to the problem of finding internal collisions. Nonetheless, we emphasize that this is only one of the requirements of a cryptographic primitive, thus the resistance against the remaining attacks should be checked after completing the whole design.
The reason we use state collisions as our unique security requirement is twofold. First, we cannot fathom how designers will use our building blocks, and this notion applies directly to many different schemes, like hash functions or MAC and AE where a state collision would yield forgery. Therefore, by focusing only on this notion, we maximize the security of future designs based on these building blocks. Second, the inherent algorithmic problem is wellstudied and understood: it consists in finding special types of differential characteristics that start and end in zero difference. Finally, we can also argue how significant this requirement is by recalling that several primitives have been broken due to susceptibility to attacks based on state collisions (see for instance [13, 20]).
To find a state collision means to identify two different sequences of messages such that, from the same initial state value, the same output state value is reached in the scheme after injecting the different message sequences. Consequently, we can describe this problem as finding a highprobability differential characteristic from the allzero state difference to the same allzero state difference, where the differences come from the message bytes. By highprobability, we mean higher than \(2^{128}\) since we focus on the AES, which relies on a 128bit internal state.
Minimum number of active Sboxes in the AES in the singlekey model.
Rounds  1  2  3  4  5  6  7  8  9  10 

Active Sboxes  1  5  9  25  26  30  34  50  51  55 
Therefore, to construct secure designs based on the AES round function when no differences are introduced in the subkeys, it is sufficient to ensure that a difference enters four rounds of AES. Indeed, four rounds necessarily have at least 25 active Sboxes, which directly yield an upper bound on any differential characteristic probability: \(2^{6\cdot 25}=2^{150}\ll 2^{128}\). This 4round barrier explains why many previous designs chose to exploit this provable bound and gain in efficiency in comparison to the ten rounds used in the actual AES128 block cipher.
In our case, we are interested in designs which achieve higher performances and do not necessarily rely on four rounds of AES. Consequently, the differential characteristic mentioned before that starts and ends in nodifference states must activate at least 22 Sboxes, so its probability would be at most \(2^{6\cdot 22}=2^{132}<2^{128}\). Hence, in the sequel the security goals imposed on our designs are such that their best differential characteristic has at least 22 active Sboxes.
2.4 General Structure and Definitions
We define here the classes of AESbased designs that we study in the remaining of the paper. For all the aforementioned reasons, we focus on only two operations on 128bit values: the AES round function denoted by A and performed by the aesenc instruction, and the XOR operation denoted by \(\oplus \).
We emphasize that all the designs belonging to these classes implement shifts of the state words to make the various applications of A to be independent. Consequently, each updated word \(X^{i+1}_{t}\), for \(0\le t < s\), necessarily depends on \(X^{i}_{t1 \pmod {s}}\), and optionally on \(X^{i}_{t}\). The main rationale behind this stems from the objective to reach high efficiency: should the diffusion be higher, for instance where a single output of A would be XORed to every output words, the processor would have to wait until all the output words have their final value. In our case, the shifts allow to optimize the usage of the processor cycles: starting evaluating the design from right to left, the first call to A is likely to be finished evaluating when we start processing the leftmost state word. Hence, the iteration \(i+1\) can start without waiting for the end of iteration i.
However, this optimized scheduling of instructions comes at the expense of the diffusion: from a single bit difference in the input state, reaching a full diffusion might take several steps. As a complete opposite, reaching full diffusion in a single step would mean XORing the output of a single A to all the output state words, and would waste many cycles. While this seems to suggest an interesting tradeoff, we nevertheless show in the sequel that there do exist designs in the class \(\mathcal A_{\oplus }\) which, at the same time, achieve optimally high efficiency and meet our security requirements.
In terms of implementation, as mentioned before, the aesenc operations ends with the XOR of a round subkey and as a result, the implementations may benefit from this free operation. Namely, if we should XOR the message block M after the aesenc, we could just use the instruction \({\texttt {aesenc}}(\bullet , M)\). Otherwise, we might just use \({\texttt {aesenc}}(\bullet , 0)\).
Notations. We use the following notations to describe the designs. We introduce the parameters s that represents the number of 128bit state words, a the number of AES rounds in a single step, and m the number of 128bit message blocks processed per step. Additionally, we denote by \(\rho \) the rate of the design following Definition 1, that is \(\rho =a/m\).
3 The Class \(\mathcal A_{\oplus }^{r}\) and Rate Bounds
Designs from \(\mathcal A_{\oplus }^{r}\) are easier to analyze as they resemble r rounds of the AES. As a result, their main advantage lies in the possibility to use the widetrail strategy of the AES which dictates that the minimal number of active Sboxes of 2, 3, and 4 rounds of AES is 5, 9, and 25 active Sboxes, respectively (see Table 2). For example, to prove that a particular \(\mathcal A_{\oplus }^{3}\) design is secure by our definition, we have to show that in any differential characteristic that starts and ends in a zero difference, a state difference must go at least three times through the cascaded three rounds of AES. Such design would be secure, because the number of active Sboxes for any characteristic would be at least \(3 \cdot 9 = 27 \ge 22\). For the class \(\mathcal A_{\oplus }^{2}\) (resp. \(A_{\oplus }^{4}\)), the similar requirement is to activate five times (resp. once), the cascaded 2round (resp. 4round) AES.
The efficiencies of these designs, however, are limited. Further, we show that their rates cannot be arbitrary low, but are in fact bounded by r.
Theorem 1
Proof
Any design from \(\mathcal A_{\oplus }^{r}\) can be divided into several parts. Each rstep cascaded aesenc with the corresponding state words composes a socalled nonlinear part. Consecutive XORs of the message and the state words (with no aesenc in between) also compose a part, called a linear part. Note, there can be several nonlinear and linear parts. For instance, the design from Fig. 4 can be divided into two nonlinear parts (denoted with thick lines) and two linear parts (the remaining two parts between the nonlinear parts).
A design is insecure if we can build a highprobability differential characteristic that starts and ends in zero state difference (but some intermediate state words have nonzero differences introduced through the message words). Further, we show that if the rate is too small, more precisely if \(\rho <r\), then we can build a differential characteristic with no active Sboxes. That is, the difference in the state can be introduced through the message words and then canceled in the following steps, without reaching the state words to which aesenc is applied. As a result, the probability of that differential characteristic would be one.
Remark 1
The rate bound holds for any design based on rround cascaded AES (and not only for the class with shifts to the right, that we analyze).
From the theorem, we can conclude that regardless of the actual construction, designs from \(\mathcal A_{\oplus }^{4},\mathcal A_{\oplus }^{3}\) and \(\mathcal A_{\oplus }^{2}\) cannot have rates lower than 4, 3, and 2, respectively, and thus cannot run faster than 0.250 c/B, 0.188 c/B, and 0.125 c/B, respectively.
Note, as the step functions of AEGIS128L and Tiaoxin346 run at 0.250 c/B and 0.188 c/B (have rates 4 and 3), in order to find more efficient designs, we have to either find rate3 designs with smaller states (at most 12 words as Tiaoxin346 has 13 words), or designs with lower rate. We have run a complete search of all designs from \(\mathcal A_{\oplus }^{3}\) with at most 12 state words and found that none of them is secure^{13}. Furthermore, we have run a partial search^{14} among designs from \(\mathcal A_{\oplus }^2\) and found constructions with rate 2.66, but not lower. Thus, to achieve more efficient designs, in the next section we examine the class \(\mathcal A_{\oplus }\).
4 Designs in the Class \(\mathcal A_{\oplus }\)
In this section, we focus on the more general class of designs \(\mathcal A_{\oplus }\), where the AES round function is not necessarily iterated. From a cryptanalytic standpoint, it means this class encompasses designs where state differences can be introduced between two consecutive AES round functions. The main consequence in comparison to the previous class \(\mathcal A_{\oplus }^{r}\) from Sect. 3 is that we lose the simplicity of the analysis brought by the widetrail strategy. One could compare the change of analysis as transition from the singlekey framework of the AES to its relatedkey counterpart (where differences may be introduced between consecutive rounds).
However, in spite of the more complex analysis, we show there exists lowrate designs in this larger class that meet our security requirements. Namely, we show several designs that achieve rates 3, 2.5, and even rate 2.
The study of \(\mathcal A_{\oplus }\) is less straightforward than the previous case, thus we rely on mixed integer linear programming (MILP) to derive lower bounds on the number of active Sboxes the designs. In the next sections, we briefly recall the MILP technique applied to cryptanalysis (Sect. 4.1) and we detail our results (Sect. 4.2).
4.1 MILP and Differential Characteristic Search
From a highlevel perspective, a MILP problem aims at optimizing a linear objective function subject to linear equalities and/or linear inequalities. The technique we use in this paper is said to be mixed integer linear programming as it alleviates the allinteger constraint on the classical linear programming variables. More precisely, in our case some variables might not be integers, but all the integer variables are 0–1 variables. Therefore, we could dub this particular setup as 0–1 MILP.
The 0–1 MILP problems are usually NPhard, but solutions can be found using different strategies, for instance, the cuttingplane method which iteratively refines a valid solution by performing cuts relying on the linear inequality constraints of the problem. For our purposes, we use one of the many solvers existing to date, namely the Gurobi solver [10]. Several published results rely on MILP optimization tools to solve cryptanalytic problems: searches for differential characteristics in various schemes are given in [18], known lower bounds for the number of active Sboxes for the relatedkey setting of AES in [16], analysis of reduced versions of the Trivium stream cipher in [4], etc.
We aim at finding differential characteristics from the allzero difference input state to the same allzero output state after a variable number of steps. As mentioned before, our measure of security relies on the number of active Sboxes, which gives an upper bound on the success probability of a differential attack that may lead to state collisions. We transform the search of differential characteristics into MILP problems whose objective functions count (and minimize) the number of active Sboxes. In practice, since we use the AES round function, we only require the differential characteristics to have at least 22 active Sboxes to ensure security.
For a given state size of s 128bit words, to express the problem of finding a differential characteristic, we examine the effect of the four elementary transformations of the AES round function. We emphasize that the analysis is performed in terms of truncated differences (\(x\in \{0,1\}\)) since we are only concerned about active or inactive Sboxes: the actual differences are insignificant. Therefore, as soon as one Sbox is active, the SubBytes operation maintains this property. Hence, SubBytes does not introduce any linear constraints in the MILP problem. The same holds for the ShiftRows operation, which only permutes the bytes of the internal state.
In summary, for a single round of AES, we introduce \(4\times 9+16\times 4=100\) inequalities to express the round constraints. On top of that, we introduce \(16\times 4=64\) additional inequalities for every extra XORs required to inject the message blocks. Finally, we also need to add \(2\times s\times 16\) equality constraints to represent the required zero difference in the input state and in the output state to reach a state collision. To give concrete numbers, we point out that systems corresponding to our smaller designs would need around 10,000 binary variables and 20,000 to 30,000 linear constraints.
Limitations. Despite providing a simple and efficient way of finding differential characteristics, MILP only yields upper bounds on the actual probabilities of the differential characteristics as, theoretically, they can be impossible. We emphasize that this does not relate to impossible differential characteristic, but to the fact that partially undetermined behavior of the XOR operation (mentioned before) may result in inconsistent systems that produce truncated differential characteristics which are impossible to instantiate with actual differences. Fortunately, while a cryptanalyst should ensure the validity of the produced characteristics, we, as designers, only need to confirm that the upper bound on the probability of the best differential characteristic is sufficiently low.
4.2 Results of the Search
In this section, we conduct the search for efficient designs and describe the results produced by the MILP analysis. In the next Sect. 5, we give the actual implementations and benchmarks of the produced designs.
5 Implementations Results
We benchmark the seven constructions on the latest Intel’s processors. The aesenc on some of these processors have similar performances (see Table 1), thus we benchmark on only three different platforms: Ivy Bridge (i53470) with Linux kernel 3.11.012 and gcc 4.8.1, Haswell (i54570) with Linux kernel 3.11.012 and gcc 4.8.1, and Skylake (i56200U) with Linux kernel 3.16.038 and gcc 4.8.4. We wrote the implementations in C and optimized them separately for each processor. The benchmarks were produced with disabled Turbo Boost and for 64kB messages^{16}.
Benchmarks (in c/B) of designs based on the AES round function. s: number of 128bit state words, a: number of AES rounds in a single step, m: number of 128bit message blocks processed per step, x number of additional XORs per step, \(\rho \): rate of design (a / m), LB: lower bound on the number of active SBoxes. Open image in new window numbers means that the aesenc port is saturated for the given processor. Numbers in parentheses are projections, no actual measurements have been performed. Numbers in bold denotes that practical and theoretical speed match (less than 5 % difference), while numbers with \(+\) (resp. −) denote that the practical speed is higher (resp. lower) than the theoretical.
From the table, we can see that in most of the cases, our benchmarks follow the expected speed. For Ivy Bridge, the exceptions are the rate3 design, which runs in 0.222 c/B instead of the expected 0.189 c/B (17 % slower), and the rate2 design that runs at 0.190 c/B instead of 0.167 c/B (13 % slower). For Haswell, three designs run faster than expected, with gains of 15 %, 24 %, 22 %, respectively. On Skylake, the measured speed matches the expected speed for all seven constructions.
Among the seven constructions, we would like to single out the last constructions that has rate of 2, i.e. it uses two AES rounds to process a 16byte message. On all of the three tested processors, this construction is exceptionally efficient. In addition, on Skylake, we were able to match the actual theoretical speed (our measured speed was 0.126 c/B against the theoretical 0.125 c/B). Hence, designs based on this construction may run five times faster than AES128.
We note that on platforms without AESNI support our design cannot reach the target speed. However, by no means they are slow as they use only 2–3 AES rounds to process 16byte message block. Hence, the expected speed on these platforms is still much higher than the speed of AES, e.g. we expect that our constructions will run around 3–5 times faster than AES128 in counter mode.
In addition, the state sizes of the constructions are large hence they are not suitable for lightweight applications. However, we note that all seven constructions have sizes which are smaller than the state of SHA3 which has 25 64bit state words (equivalent to 12.5 128bit blocks).
6 Conclusion
We have presented new building blocks for secretkey primitives based on the AES round function. By targeting the most recent Intel processors from the past four years, we have relied on the dedicated instruction set AESNI to construct highly efficient designs. The designs are finely tuned for these processors to take advantage of the available parallelism and to reach optimal speed. They are based on the second, more efficient design strategy which requires a more complex security proof (reduction to MILP), but allows higher efficiency.
We have provided seven different building blocks that follow our design strategies and that reach high speed on the latest processors. On Ivy Bridge they run at 0.190–0.250 c/B, on Haswell at 0.136–0.219 c/B, while on Skylake at 0.125–0.188 c/B. We emphasize that our fastest construction uses only two AES rounds to process 16byte message and on Skylake runs at only 0.125 c/B. To the best of our knowledge, this construction is much faster than any known cryptographic primitive.
Followup works to introduce better designs may start from two related directions: either by trying to reduce the state size, or by increasing the number of processed message in each step of the designs. The former might be useful to improve so designs that requires too many registers and slow down the whole process. The latter would automatically reduce the rate of the design and directly affect the measured speed. This direction is however difficult to tackle as the adversary has a lot more freedom to construct highprobability characteristics.
Footnotes
 1.
This approach was chosen in Tiaoxin346, where 2round AES is used.
 2.
In addition to the encryption and decryption rounds, AESNI includes as well instructions that perform subkey generation and inverse MixColumns. Note that the four individual round operations can be realized as a composition of different instructions from AESNI. However, such composition would have greatly reduced efficiency in comparison to the round calls.
 3.
Here, we simply use AES to refer to the AES128.
 4.
The last round in AES is different and it is executed with a call to the AESNI instruction aesenclast, which has similar performance to aesenc.
 5.
A smaller rate is not a sufficient condition of efficiency as parallelizing aesenc calls plays an important role as well (see the next paragraph).
 6.
Bogdanov et al. [2] have analyzed the speed improvements of serial modes when processing multiple messages in parallel.
 7.
Recall that the AESCBC is defined as \(C_{i+1} = {\texttt {AES}}_K(C_i \oplus M_{i+1})\).
 8.
Recall that the AESCTR is defined as \(C_{i} = {\texttt {AES}}_K(Ni) \oplus M_{i}\), where N is a nonce.
 9.
Assuming that all the calls to aesenc of the next round depend on some of the outputs of the previous four aesenc calls.
 10.
The part of the processor that executes aesenc.
 11.
If the aesenc are sufficiently independent between steps.
 12.
For instance, the initialization and finalization stages of the constructed stream cipher or authenticated encryption scheme.
 13.
This gives a rise to the conjecture that the inequality from the theorem is strict.
 14.
In this case, the search space cannot be exhausted as it is too large.
 15.
The design uses only 3 aesenc calls per round, whereas the smallest latency among all the processors is 4.
 16.
Only a slight degradation of speed is observed when the message length is a few kilobytes.
References
 1.Biryukov, A.: The design of a stream cipher LEX. In: Biham, E., Youssef, A.M. (eds.) SAC 2006. LNCS, vol. 4356, pp. 67–75. Springer, Heidelberg (2007)CrossRefGoogle Scholar
 2.Bogdanov, A., Lauridsen, M.M., Tischhauser, E.: Comb to pipeline: fast software encryption revisited. In: Leander, G. (ed.) FSE 2015. LNCS, vol. 9054, pp. 150–171. Springer, Heidelberg (2015)CrossRefGoogle Scholar
 3.Bogdanov, A., Mendel, F., Regazzoni, F., Rijmen, V., Tischhauser, E.: ALE: AESbased lightweight authenticated encryption. In: Moriai, S. (ed.) FSE 2013. LNCS, vol. 8424, pp. 447–466. Springer, Heidelberg (2014)Google Scholar
 4.Borghoff, J., Knudsen, L.R., Stolpe, M.: Bivium as a mixedinteger linear programming problem. In: Parker, M.G. (ed.) Cryptography and Coding 2009. LNCS, vol. 5921, pp. 133–152. Springer, Heidelberg (2009)CrossRefGoogle Scholar
 5.CAESAR. Competition for Authenticated Encryption: Security, Applicability, and Robustness. http://competitions.cr.yp.to/caesar.html
 6.Daemen, J., Rijmen, V.: The Design of Rijndael: ALE  The Advanced Encryption Standard. Springer, Heidelberg (2002)CrossRefzbMATHGoogle Scholar
 7.Daemen, J., Rijmen, V.: A new MAC construction ALRED and a specific instance ALPHAMAC. In: Gilbert, H., Handschuh, H. (eds.) FSE 2005. LNCS, vol. 3557, pp. 1–17. Springer, Heidelberg (2005)CrossRefGoogle Scholar
 8.Daemen, J., Rijmen, V.: The MAC function Pelican 2.0. Cryptology ePrint Archive, report 2005/088 (2005)Google Scholar
 9.Derbez, P., Fouque, P.A., Jean, J.: Improved key recovery attacks on reducedround AES in the singlekey setting. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. LNCS, vol. 7881, pp. 371–387. Springer, Heidelberg (2013)CrossRefGoogle Scholar
 10.Gurobi Optimization, Inc.: Gurobi Optimizer Reference Manual (2015)Google Scholar
 11.Jakimoski, G., Khajuria, S.: ASC1: an authenticated encryption stream cipher. In: Miri, A., Vaudenay, S. (eds.) SAC 2011. LNCS, vol. 7118, pp. 356–372. Springer, Heidelberg (2012)CrossRefGoogle Scholar
 12.Käsper, E., Schwabe, P.: Faster and timingattack resistant AESGCM. In: Clavier, C., Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 1–17. Springer, Heidelberg (2009)CrossRefGoogle Scholar
 13.Khovratovich, D., Rechberger, C.: The LOCAL attack: cryptanalysis of the authenticated encryption scheme ALE. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282, pp. 174–184. Springer, Heidelberg (2014)CrossRefGoogle Scholar
 14.Li, L., Jia, K., Wang, X.: Improved singlekey attacks on 9round AES192/256. In: Cid, C., Rechberger, C. (eds.) FSE 2014. LNCS, vol. 8540, pp. 127–146. Springer, Heidelberg (2015)Google Scholar
 15.Mala, H., Dakhilalian, M., Rijmen, V., ModarresHashemi, M.: Improved Impossible differential cryptanalysis of 7round AES128. In: Gong, G., Gupta, K.C. (eds.) INDOCRYPT 2010. LNCS, vol. 6498, pp. 282–291. Springer, Heidelberg (2010)CrossRefGoogle Scholar
 16.Mouha, N., Wang, Q., Gu, D., Preneel, B.: Differential and linear cryptanalysis using mixedinteger linear programming. In: Wu, C.K., Yung, M., Lin, D. (eds.) Inscrypt 2011. LNCS, vol. 7537, pp. 57–76. Springer, Heidelberg (2012)CrossRefGoogle Scholar
 17.Nikolić, I.: Tiaoxin346. Submission to the CAESAR Competition (2014)Google Scholar
 18.Sun, S., Hu, L., Wang, P., Qiao, K., Ma, X., Song, L.: Automatic security evaluation and (relatedkey) differential characteristic search: application to SIMON, PRESENT, LBlock, DES(L) and other bitoriented block ciphers. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014. LNCS, vol. 8873, pp. 158–178. Springer, Heidelberg (2014)Google Scholar
 19.Wu, H., Preneel, B.: AEGIS: a fast authenticated encryption algorithm. Cryptology ePrint Archive, report 2013/695 (2013)Google Scholar
 20.Wu, S., Wu, H., Huang, T., Wang, M., Wu, W.: Leakedstateforgery attack against the authenticated encryption algorithm ALE. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 377–404. Springer, Heidelberg (2013)CrossRefGoogle Scholar