Abstract
Stream ciphers are complex state machines that generate an infinite stream of pseudo-random bits starting from a single key. These bits can be used as a keystream in encryption and decryption operations. In this chapter we’ll discuss the implementation of such a stream cipher algorithm, called Trivium, as a co-processor. The co-processor is attached to a host processor. The software on that host processor initializes the Trivium coprocessor, and retrieves a very long (infinite) keystream. We consider different types of host processors, including an 8-bit 8051 microcontroller, a 32-bit StrongARM RISC, and a 32-bit Microblaze processor. We will consider the impact of different types of hardware–software interfaces on the performance of the overall design. We will also investigate the path to implementation on an FPGA.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 The Trivium Stream Cipher Algorithm
The Trivium stream cipher algorithm was proposed by Christophe De Canniere and Bart Preneel in 2006 in the context of the eSTREAM project, a European effort that ran from 2004 to 2008 to develop a new stream ciphers. In September 2008, Trivium was selected as a part of the official eSTREAM portfolio, together with six other stream cipher algorithms. The algorithm is remarkably simple, yet to this date it remains unbroken in practice. We will clarify further what it means to break a stream cipher. In this section, we discuss the concept of a stream cipher, and the details of the Trivium algorithm.
1.1 Stream Ciphers
Let us first clarify how a stream cipher works and how it is different from a block cipher such as AES (which was discussed in Chap. 10). The left of Fig. 11.1 illustrates the difference between a stream cipher and a block cipher. A stream cipher** is a state machine with an internal state register of n bits. The stream cipher kernel will initialize the state register based on a key, and it will update the state register while producing the keystream.
In contrast to a stream cipher, a block cipher is a state-less function that combines an m bit key with a block of n bits of plaintext. Because there is no state, the encryption of one block of plaintext bits is independent of the encryption of the previous block of bits. Of course, many hardware implementations of block ciphers do include registers. However, these registers are an effect of sequentializing the block cipher algorithm; it is perfectly feasible to implement block ciphers without any register.
The cryptographic properties of the stream cipher are based on the highly nonlinear functions used for state register initialization and state register update. These nonlinear functions ensure that the keystream cannot be predicted even after a very large number of keystream bits has been observed. Breaking a stream cipher means that one has found a way to predict the output of the stream cipher, or even better, one has found a way to reveal the contents of the state register. For a state register of n bits, the stream cipher can be in 2n possible states, so the total length of the key stream can be no more than n. 2n bits. Practical stream ciphers have an n between 80 and several hundred.
A stream cipher by itself does not produce ciphertext, but only a stream of keybits. The right of Fig. 11.1 illustrates how one can perform encryption and decryption with a stream cipher. The keystream is combined (xor-ed) with a stream of plaintext bits to obtain a stream of ciphertext bits. Using an identical stream cipher that produces the same keystream, the stream of ciphertext bits can be converted back to plaintext using a second xor operation.
A stream cipher algorithm produces, conceptually, a stream of bits. When the message to encrypt is not formatted as a stream of bits, but as a stream of bigger entities, such as words, the stream cipher will need to produce a stream of words instead. On a RISC processor, for example, it makes sense to represent a stream as a sequence of 32-bit words. Therefore, depending on the computer architecture, we would have a key-stream formatted as single bits, as bytes, as 32-bit words, and so on. One way to obtain a wider keystream is to run the stream cipher kernel at high speed and perform a serial-to-parallel conversion of the output. An alternative is illustrated in Fig. 11.2: the stream cipher can be easily parallelized to produce multiple keystream bits per clock cycle. This is especially useful when the stream cipher kernel is a very simple function, as is the case with Trivium.
1.2 Trivium
Trivium is a stream cipher with a state register of 288 bits. The state register is initialized based on an 80-bit key and an 80-bit initial value (IV). After initialization, Trivium produces a stream of keybits. The specification of Trivium is shown in Listing 11.1. Each iteration of the loop, a single output bit z is generated, and the state register s is updated. The addition and multiplication (+ and .) are taken over GF(2). They can be implemented with exclusive-or and bitwise-and, respectively. The double-bar operation (||) denotes concatenation.
The initialization of the state register proceeds as follows. The 80-bit key K and the 80-bit initial value IV are loaded into the state register, and the state register is updated 4 times 288 without producing keybits. After that, the state register is ready to produce keystream bits. This is illustrated in the pseudocode of Listing 11.2.
These listings confirm that, from a computational perspective, Trivium is a very simple algorithm. A single state register update requires nine single-bit xor operations and three single-bit and operations. We need two additional single-bit xor operations to produce the output bit z.
1.3 Hardware Mapping of Trivium
A straightforward hardware mapping of the Trivium algorithm requires 288 registers, 11 XOR gates, and 3 AND gates. Clearly, the largest cost of this algorithm is in the storage. Figure 11.3 shows how Trivium is partitioned into hardware modules.
-
The trivium module calculates the next state. We will use the term Trivium kernel to indicate the loop body of Listing 11.1, without the state register update.
-
The keyschedule module manages state register initialization and update. The keyschedule module has a single control input ld to initiate the state register initialization processor. In addition, keyschedule has a single status bit e that indicates when the initialization has completed, and thus when the output keystream z is valid. This partitioning between keyschedule and trivium kernel was chosen with loop unrolling in mind (Fig. 11.2).
Based on this partitioning and the Trivium specification given earlier, it is straightforward to create a GEZEL description of Trivium. Listing 11.3 shows the implementation of a 1 bit per cycle Trivium. The control signals in the keyschedule module are generated based on a counter which is initialized after a pulse on the ld control input.
To create a bit-parallel keystream, we need to modify the code as follows. First, we need to instantiate the trivium module multiple times, and chain the state input and output ports together as shown in Fig. 11.2. Second, we need to adjust the key schedule, because the initialization phase will take less than four times 288 clock cycles. As an example, Listing 11.4 shows how to unroll Trivium eight times, thus obtain a stream cipher that generates one byte of keystream per clock cycle. In this case, the initialization completes 8 times faster, after 143 clock cycles (line 33).
What is the limiting factor when unrolling Trivium? First, notice that unrolling the algorithm will not increase the critical path of the Trivium kernel operations as long as they affect different state register bits. Thus, as long as the state register bits read are different from the state register bits written, then all the kernel operations are independent. Next, observe that a single Trivium round consists of three circular shift registers, as shown in Fig. 11.4. The length of each shift register is indicated inside of the shaded boxes. To find how far we can unroll this structure, we look for the smallest feedback loop. This loop is located in the upper circular shift register, and spans 69 bits. Therefore, we can unroll Trivium at least 69 times before the critical path will increase beyond a single AND-gate and two XOR gates. In practice, this means that Trivium can be easily adjusted to generate a key-stream of double-words (64 bits). After that, the critical path will increase each 69 bits. Thus, a 192 bit-parallel Trivium will be twice as slow as a 64 bit-parallel Trivium, and a 256 bit-parallel Trivium will be roughly three times as slow.
1.4 A Hardware Testbench for Trivium
For completeness, we also show a hardware testbench for the Trivium kernel in Listing 11.5. In this testbench, the key value is programmed to 0x80 and the IV to 0x0. After loading the key (lines 12–15), the testbench waits until the e-flag indicates the keystream is ready (lines 29–30). Next, each output byte is printed on the output (lines 19–22). The first 160 cycles of the simulation produce the following output.
> fdlsim trivium8.fdl 160 147 11001100 cc 148 11001110 ce 149 01110101 75 150 01111011 7b 151 10011001 99 152 10111101 bd 153 01111001 79 154 00100000 20 155 10011010 9a 156 00100011 23 157 01011010 5a 158 10001000 88 159 00010010 12
The key stream bytes produced by Trivium consists of the bytes 0xcc, 0xce, 0x75, 0x7b, 0x99, and so on. The bits in each byte are read left to right (from most significant to least significant). In the next sections, we will integrate this module as a coprocessor next to a processor.
2 Trivium for 8-bit Platforms
Our first coprocessor design will attach the Trivium stream cipher hardware to an 8-bit microcontroller. We will make use of an 8051 microcontroller. Like many other microcontrollers, it has several general-purpose digital input–output ports, which can be used to create hardware–software interfaces. Thus, we will be building a port-mapped control shell for the Trivium coprocessor. The 8051 microcontroller also has an external memory bus (XBUS), which supports a memory space of 64K. Such external memory busses are rather uncommon for microcontrollers. However, we will demonstrate the use of such a memory-bus in our design as well.
2.1 Overall Design of the 8051 Coprocessor
Figure 11.5 illustrates the overall design. The coprocessor is controlled through three 8-bit ports (P0, P1, P2). They are used to transfer operands, instructions, and to retrieve the coprocessor status, respectively. The Trivium hardware will dump the resulting keystream into a dual-port RAM module, and the contents of the keystream can be retrieved by the 8051 through the XBUS.
The system works as follows. First, the 8051 programs a key and an initialization vector into the Trivium coprocessor. Next, the 8051 commands the Trivium coprocessor to generate N keybytes, which will be stored in the shared RAM on the XBUS. Finally, the 8051 can retrieve the keybytes from the RAM. Note that the retrieval of the keybytes from RAM is only shown as an example; depending on the actual application, the keystream may be used for a different purpose. The essential part of this example is the control of the coprocessor from within the microcontroller.
To design the control shell, we will need to develop a command set for the Trivium coprocessor. As the 8-bit ports of the 8051 do not include strobes, we will make use of a similar handshake procedure as was used earlier in Chap. 10: a simple idle instruction will help us to determine the exact clock cycle when a command becomes valid. The command set for the coprocessor is shown in Table 11.1. All of the commands except one complete within a single clock cycle. The last command, ins_enc, takes up to 256 clock cycles to complete. The status port of the 8051 is used to indicate when the the encryption phase has completed. Figure 11.6 illustrates the command sequence for the generation of 10 bytes of keystream. Note that the status port becomes zero when the keystream generation is complete.
2.2 Hardware Platform of the 8051 Coprocessor
We will now capture the hardware platform of Fig. 11.5 as a GEZEL program. Listing 11.6 shows the complete platform apart from the Trivium kernel (which was discussed in Sect. 11.1.3). The first part of the Listing captures all the 8051-specific interfaces. The Trivium coprocessor will be connected on top of these interfaces.
-
Line 1–6: The 8051 core my8051 will read in an executable called trivium.ihx. The executable is in Intel Hex Format, a common format for microcontroller binaries. The period of the core is 1, meaning that the clock frequency of the 8051 core is the same as the hardware clock frequency. A traditional 8051 architecture uses 12 clock cycles per instruction. Thus, a period of 1 means that there will be a single instruction executing each 12 clock cycles.
-
Line 7–21: Three I/O ports of the 8051 are defined as P0, P1, and P2. A port is configured either as input or as output by choosing its type to be i8051systemsource (e.g., Line 8,13) or else i8051systemsink (e.g., Line 18).
-
Line 22–30: A dual-port, shared-memory RAM module attached to the XBUS is modeled using an ipblock. The module allows to specify the starting address (xbus, Line 28) as well as the amount of memory locations (xrange, Line 29).
The triviumitf module integrates the Trivium hardware kernel (Line 44) on top of the hardware/software interfaces. Several registers are used to manage this module, including a Trivium state register tstate, a round counter cnt, and a ram address counter ramcnt (Line 50–53).
The key and initialization vector are programmed into the state register through a sequence of chained multiplexers (Line 56–82). This works as follows. First consider the update of tstate on Line 82. If the counter value cnt is nonzero, tstate will copy the value so, which is the output of the Trivium kernel. If the counter value cnt is zero, tstate will instead copy the value of init, which is defined through Line 56–78. Thus, by loading a nonzero value into cnt (Line 80–81), the Trivium kernel performs active encryption rounds.
Now, when the count value is zero, the state register can be reinitialized with a chosen key and initialization vector. Each particular command in the range 0x1 to 0x14 will replace a single byte of the key or the initialization vector (Line 56–76). The init command will pad 0b111 into the most significant bits of the state register (Line 78).
Finally, the RAM control logic is shown on Line 86–89. Whenever the count value is nonzero, the ram address starts incrementing and the ram interface carries a write command.
2.3 Software Driver for 8051
The software driver for the above coprocessor is shown in Listing 11.7. This C code is written for the 8051 and can be compiled with SDCC, the Small Devices C Compiler (http://sdcc.sourceforge.net). This compiler allows directly using symbolic names, such as the names of the I/O ports P0, P1, and P2.
The program demonstrates the loading of a key and initialization vector (Line 21–43), the execution of the key schedule (Line 46–50), and the generation of a keystream of 250 bytes (Line 53–56). Note that the software driver does not strictly follow the interleaving of active commands with ins_idle. However, this code will work fine for the hardware model from Listing 11.6.
As discussed before, the key scheduling of Trivium is similar to the normal operation of Trivium. Key scheduling involves running Trivium for a fixed number of rounds while discarding the keystream. Hence, the key scheduling part of the driver software is, apart from the number of rounds, identical to the encryption part.
Finally, Line 64 illustrates how to terminate the simulation. By writing the value 0x55 into port P3, the simulation will halt. This is an artificial construct. Indeed, the software on a real microcontroller will run indefinitely.
We can now compile the software driver and execute the simulation. The following commands illustrate the output generated by the program. Note that the 8051 microcontroller does not support standard I/O in the traditional sense: it is not possible to use printf statements without additional I/O hardware and appropriate software libraries. The instruction-set simulator deals with this limitation by printing the value of all ports each time a new value is written into them. Hence, the four columns below correspond to the value of P0, P1, P2, and P3, respectively. We annotated the tool output to clarify the meaning of the sequence of values.
> sdcc trivium.c > gplatform tstream.fdl i8051system: loading executable [trivium.ihx] 0xFF 0x01 0x00 0xFF 0x80 0x01 0x00 0xFF (*@ \rightbox{Program Key} @*) 0x80 0x02 0x00 0xFF 0x00 0x02 0x00 0xFF 0x00 0x03 0x00 0xFF 0x00 0x04 0x00 0xFF 0x00 0x05 0x00 0xFF 0x00 0x06 0x00 0xFF 0x00 0x07 0x00 0xFF 0x00 0x08 0x00 0xFF 0x00 0x09 0x00 0xFF 0x00 0x0A 0x00 0xFF 0x00 0x0B 0x00 0xFF (*@ \rightbox{Program IV} @*) 0x00 0x0C 0x00 0xFF 0x00 0x0D 0x00 0xFF 0x00 0x0E 0x00 0xFF 0x00 0x0F 0x00 0xFF 0x00 0x10 0x00 0xFF 0x00 0x11 0x00 0xFF 0x00 0x12 0x00 0xFF 0x00 0x13 0x00 0xFF 0x00 0x14 0x00 0xFF 0x00 0x15 0x00 0xFF 0x8F 0x15 0x00 0xFF 0x8F 0x16 0x00 0xFF 0x8F 0x00 0x7A 0xFF (*@ \rightbox{Run key schedule} @*) 0xFA 0x00 0x00 0xFF 0xFA 0x16 0x00 0xFF 0xFA 0x00 0xE5 0xFF (*@ \rightbox{Produce 250 bytes} @*) 0x00 0x00 0x00 0xFF 0x00 0xCB 0x00 0xFF (*@ \rightbox{First output byte} @*) 0x01 0xCB 0x00 0xFF 0x01 0xCC 0x00 0xFF (*@ \rightbox{Second output byte} @*) 0x02 0xCC 0x00 0xFF 0x02 0xCE 0x00 0xFF (*@ \rightbox{Third output byte} @*) 0x03 0xCE 0x00 0xFF 0x03 0x75 0x00 0xFF 0x04 0x75 0x00 0xFF 0x04 0x7B 0x00 0xFF 0x05 0x7B 0x00 0xFF 0x05 0x99 0x00 0xFF 0x06 0x99 0x00 0xFF 0x06 0xBD 0x00 0xFF 0x07 0xBD 0x00 0xFF 0x07 0x79 0x00 0xFF 0x07 0x79 0x00 0x55 (*@ \rightbox{Terminate} @*) Total Cycles: 13332
The last line of output shows 13,232 cycles, which is a long time when we realize that a single key stream byte can be produced by the hardware within a single clock cycle. How hard is it to determine intermediate time-stamps on the execution of this program? While some instruction-set simulators provide direct support for this, we will need to develop a small amount of support code to answer this question. We will introduce an additional coprocessor command which, when observed by the triviumitf module, will display the current cycle count. This is a debug-only command, similar to the terminate call in the 8051 software.
The modifications for such a command to the code are minimal. In the C code, we add a function to call when we would like to see the current cycle count.
void showcycle() { P1 = 0x20; P1 = 0x0; }
In the GEZEL code, we extend the triviumitf with a small FSM to execute the new command.
dp triviumitf { reg rupins : ns(8); ... always { ... rupins = upins; } sfg show { $display("Cycle: ", $cycle); } sfg idle { } } fsm f_triviumitf(triviumitf) { initial s0; state s1; @s0 if (rupins == 0x20) then (show) -> s1; else (idle) -> s0; @s1 if (rupins == 0x00) then (idle) -> s0; else (idle) -> s1; }
Each time showcycle() executes, the current cycle count will be printed by GEZEL. This particular way of measuring performance has a small overhead (88 cycles per call to showcycle()). We add the command in the C code at the following places.
-
In the main function, just before programming the first key byte.
-
In the main function, just before starting the key schedule.
-
In the main function, just before starting the key stream.
Figure 11.7 illustrates the resulting cycle counts obtained from the simulation. The output shows that most time is spent in startup (initialization of the microcontroller), and that the software-hardware interaction, as expected, is expensive in cycle-cost. For example, programming a new key and re-running the key schedule costs 1416 cycles, almost ten times as long as what is really needed by the hardware (143 cycles). This stresses once more the importance of carefully considering hardware–software interactions during the design.
3 Trivium for 32-bit Platforms
Our second Trivium coprocessor integrates the algorithm on a 32-bit StrongARM processor. We will compare two integration strategies: a memory-mapped interface and a custom-instruction interface. Both scenario’s are supported through library modules in the GEZEL kernel. The hardware kernel follows the same ideas as before. By unrolling a trivium kernel 32 times, we obtain a module that produces 32 bits of keystream material per clock cycle. After loading the key and initialization vector, the key schedule of such a module has to execute for \(4 {_\ast} 288/32 = 36\) clock cycles before the first word of the keystream is available.
3.1 Hardware Platform Using Memory-mapped Interfaces
Figure 11.8 shows the control shell design for a Trivium kernel integrated to a 32-bit memory-mapped interface. There are four memory-mapped registers involved: din, dout, control, and status. In this case, the key stream is directly read by the processor. The Trivium kernel follows the design we discussed earlier in Sect. 11.1.3. There is one additional control input, go, which is used to control the update of the state register. Instead of having a free-running Trivium kernel, the update of the state register will be strictly controlled by software, so that the entire keystream is captured using read operations from a memory-mapped interface.
As with other memory-mapped interfaces, our first task is to design a control shell to drive the Trivium kernel. We start with the command set. The command set must be able to load a key, an initialization vector, run the key schedule, and retrieve a single word from the key stream. Figure 11.9 illustrates the command set for this coprocessor.
The control memory-mapped register has a dual purpose. It transfers an instruction opcode as well as a parameter. The parameter indicates the part of the key or initial value which is being transferred. The parameter is 0, 1, or 2, because 3 words are sufficient to cover the 80 bits from the stream cipher key or the stream cipher initial value. The ins_idle instruction has the same purpose as before: it is used to synchronize the transfer of data operands with instructions. There are two commands to retrieve keystream bits from the coprocessor: ins_outword0 and ins_outword1. Both of these transfer a single word from the stream cipher dout, and they are used alternately in order to avoid sending dummy ins_idle to the coprocessor.
Listing 11.8 shows the design of the control shell module. The design of the Trivium kernel is not shown in this listing, although a very similar design can be found in Listing 11.4. The first part of Listing 11.8, Line 1–25, shows the memory-mapped interface to the ARM core. This includes instantiation of the core (Line 1–5), and four memory-mapped registers (Line 6–25). The bulk of the code, Line 25–73, contains the control shell for the Trivium kernel. The kernel is instantiated on Line 36. The registers for key and initial value, defined on Line 33, are programmed from software through a series of simple decode steps (Line 46–56). The encoding used by the control memory mapped register corresponds to Fig. 11.9.
The control pins of the Trivium kernel (ld, go) are programmed by means of simple decoding steps as well (Line 59, 67–71). Note that the go pin is driven by a pulse of a single clock cycle, rather than a level programmed from software. This is done by detecting the exact cycle when the value of the control memory mapped interface changes. Note that the overall design of this control shell is quite simple, and does not require complex control or a finite state machine. Finally, the system integration consists of interconnecting the control shell and the memory-mapped interfaces (Line 75–86).
3.2 Software Driver Using Memory-mapped Interfaces
A software driver for the memory-mapped Trivium coprocessor is shown in Listing 11.9. The driver programs the initial value and key, runs the key schedule, and next receives 512 words of keystream. The state update of the Trivium coprocessor is controlled by alternately writing 4 and 5 to the command field of the control memory-mapped interface. This is done during the key schedule (Line 28–32) as well as during the keystream generation (Line 34–39).
The code also contains an external system call getcyclecount(). This is a simulator-specific call, in this case specific to SimIt-ARM, to return the current cycle count of the simulation. By inserting such calls in the driver code (in this case, on Line 27, 33, 40), we can obtain the execution time of selected phases of the keystream generation.
To execute the system simulation, we compile the software driver, and run the GEZEL hardware module and the software executable in gplatform. The simulation output** shows the expected keystream bytes: 0xcc, 0xce,0x75,... The output also shows that the key schedule completed in 435 cycles, and that 512 words of keystream were generated in 10,524 cycles.
>arm-linux-gcc -static trivium.c cycle.s -o trivium >gplatform trivium32.fdl core myarm armsystem: loading executable [trivium] armsystemsink: set address 2147483648 armsystemsink: set address 2147483656 ccce757b ccce757b 99bd7920 9a235a88 1251fc9f aff0a655 7ec8ee4e bfd42128 86dae608 806ea7eb 58aec102 16fa88f4 c5c3aa3e b1bcc9f2 bb440b3f c4349c9f key schedule cycles: 435 stream cycles: 10524 Total Cycles: 269540
We now analyze the performance results for this design. As the Trivium kernel used in this design is unrolled 32 times (and thus can produce a new word every clock cycle), 512 words in 10,524 clock cycles is not a stellar result. Each word requires around 20 clock cycles. This includes synchronization of software and hardware, transfer of a result, writing that result into memory, and managing the loop counter and address generation (lines 34–39 in Listing 11.9). However, another way to phrase the performance question is: how much better is this result compared to an optimized full-software implementation? To answer this question, we can port an available, optimized implementation to the StrongARM and make a similar profiling. We used the implementation developed by Trivium’s authors, C. De Canniere, in this profiling experiment, and found that this implementation takes 3,810 cycles for key schedule and 48,815 cycles for generating 512 words. Thus, each word of the keystream requires close to 100 clock cycles on the ARM. Therefore, we conclude that the hardware coprocessor is still five times faster compared to an optimized software implementation, although the hardware coprocessor has an overhead factor of 20 times compared to a standalone hardware implementation.
As we wrote the hardware from scratch, one may wonder if it would not have been easier to try to port the Trivium software implementation into hardware. In practice, this may be hard to do, because the optimizations one does for software are very different than the optimizations one does for hardware. As an example, Listing 11.10 shows part of the software-optimized Trivium implementation of De Canniere. This implementation was written with 64-bit execution in mind. Clearly, the efficient translation of this code into hardware is quite difficult, since the specification does not have the same clarity compared to the algorithm definition we discussed at the start of the chapter.
This completes our discussion of the memory-mapped Trivium coprocessor design. In the next section, we consider a third type of hardware/software interface for the Trivium kernel: the mapping of Trivium into custom instructions on a 32-bit processor.
3.3 Hardware Platform Using a Custom-Instruction Interface
The integration of a Trivium coprocessor as a custom datapath in a processor requires a processor that supports custom-instruction extensions. As discussed in Chap. 9, this has a strong impact on the tools that come with the processor. In this example, we will make use of the custom-instruction interface of the StrongARM processor discussed in Sect. 9.5.1. Figure 11.10 shows the design of a Trivium Kernel integrated into two custom-instruction interfaces, an OP3X1 and an OP2X2. The former is an instruction that takes three 32-bit operands and produces a single 32-bit result. The latter is an instruction that takes two 32-bit operands and produces two 32-bit results.
During normal operation, the trivium state is fed through two Trivium kernels which each provide 32-bit of keystream. These two words form the results of an OP2x2 instruction. The same OP2x2 instruction also controls the update of the Trivium state. This way, each custom OP2x2 instruction advances the Trivium algorithm for 1 step, producing 64 bits of keystream. When the Trivium algorithm is not advancing, the state register can be reprogrammed by means of OP3x1 instructions. The third operand of OP3x1 selects which part of the 288-bit state register will be modified. The first and second operands contain 64 bit of state register data. The result of the OP3x1 instruction is ignored.
Thus, both programming and keystream retrieval can be done using a bandwidth of 64 bits, which is larger than the memory-mapped interface. Hence, we can expect a speedup over the previous implementation. Listing 11.10 shows a GEZEL listing for this design. As before, we have left out the Trivium kernel which is similar to the one used in Listing 11.4.
The interface with the ARM is captured on line 1–16, and this is followed by the Trivium control shell on line 18–57. The Trivium state register is represented as nine registers of 32 bit rather then a single 288-bit register. Two 32-bit Trivium kernels are instantiated on line 33 and 34. The state register update is controlled by the adv control flag, as well as the value of the third operand of the OP3X1 instruction (line 37–45). The output of the Trivium kernels is fed into the result of the OP2X2 instruction (line 51–52). Finally, the adv flag is created by detecting an edge in the OP2x2 operand (line 54–55). In practice, this means that two calls to OP2X2 are needed to advance Trivium one step.
3.4 Software Driver for a Custom-Instruction Interface
Listing 11.11 shows a software driver for the Trivium custom-instruction processor that generates a keystream of 512 words in memory. The driver starts by loading key and data (line 25–30), running the key schedule (line 34–37), and retrieving the keystream (line 41–48). At the same time, the getcyclecount system call is used to determine the performance of the key schedule and the keystream generation part.
The algorithm can be compiled with the ARM cross-compiler and simulated on top of GEZEL gplatform. This results in the following output.
>arm-linux-gcc -static trivium.c cycle.s -o trivium >gplatform triviumsfu.fdl core myarm armsystem: loading executable [trivium] ccce757b 99bd7920 9a235a88 1251fc9f aff0a655 7ec8ee4e bfd42128 86dae608 806ea7eb 58aec102 16fa88f4 c5c3aa3e b1bcc9f2 bb440b3f c4349c9f be0a7e3c key schedule cycles: 289 stream cycles: 8862 Total Cycles: 42688
We can verify that, as before, the correct keystream is generated. The cycle count of the algorithm is significantly smaller than before: the key schedule completes in 289 cycles, and the keystream is generated within 8,862 cycles. This implies that each word of keystream required around 17 cycles. If we turn on the O3 flag while compiling the driver code, we obtain 67 and 1,425 clock cycles for key schedule and keystream, respectively, implying that each word of the keystream requires less than three cycles! Hence, we conclude that for this design, an ASIP interface is significantly more efficient than a memory-mapped interface.
4 Summary
In this chapter, we designed a stream cipher coprocessor for three different hosts: a small 8-bit microcontroller, a 32-bit SoC processor, and a 32-bit ASIP. In each of these cases, we created a control shell to match the coprocessor to the available hardware–software interface. The stream cipher algorithm was easy to scale over different word-lengths by simply unrolling the algorithm. The performance evaluation results of all these implementations are captured in Table 11.2. These results demonstrate two points. First, it is not easy to achieve the performance of raw hardware. All of the coprocessors are limited by their hardware/software interface or the speed of software on the host, not by the computational limits of the hardware coprocessors. Second, the wide variation of performance results underline the importance of a carefully designed control shell, and a careful consideration of the application when selecting a hardware/software interface.
5 Further Reading
The standard reference of cryptographic algorithms is by Menezes, van Oorschot, and Vanstone Menezes et al. (2001). Of course, cryptography is a fast-moving field. The algorithm described in this section was developed for the eStream Project ECRYPT (2008) in 2005. The Trivium specifications are by De Canniere De Canniere and Preneel (2005). The Trivium webpage in the eStream website describes several other hardware implementations of Trivium.
6 Problems
11.1. Design a control shell for the Trivium algorithm on top of a Fast Simplex Link interface. Please refer to Sect. 9.4.2 for a description of the FSL timing and the FSL protocol. Assume the following interface for your module.
dp trivium_fsl(in idata : ns(32); // input slave interface in exists : ns(1); out read : ns(1); out odata : ns(32); // output master interface in full : ns(1); out write : ns(1))
11.2. Consider a simple linear feedback shift register, defined by the following polynomial: \(g(x) = {x}^{35} + {x}^{2} + 0\). A possible hardware implementation of this LFSR is shown in Fig. 11.11. This polynomial is primitive, which implies** that the LFSR will generate a so-called m-sequence: for a given initialization of the registers, the structure will cycle through all possible 235 − 1 states before returning to the same state.
-
(a)
Write an optimized software implementation of an LFSR generator that calculates the first 1,024 states starting from the initialization \({x}^{32} = {x}^{33} = {x}^{34} = {x}^{35} = 1\) and all other bits 0. For each state you need to store only the first 32 bits.
-
(b)
Write an optimized standalone hardware implementation of an LFSR generator that calculates the first 1,024 states starting from the initialization \({x}^{32} = {x}^{33} = {x}^{34} = {x}^{35} = 1\) and all other bits 0. You do not need to store the first 32 bits, but can feed them directly to an output port.
-
(c)
Design a control shell for the module you have designed under (b), and use a memory-mapped interface to capture and store the first 1,024 states of the LFSR. You only need to capture the first 32 bits of each state. Compare the resulting performance to the solution of (a).
-
(d)
Design a control shell for the module you have designed under (b), and use a custom-instruction interface to capture and store the first 1,024 states of the LFSR. You only need to capture the first 32 bits of each state. Compare the resulting performance to the solution of (a).
References
ECRYPT (2008) The estream project. Tech. rep., http://www.ecrypt.eu.org/stream/technical.html
Menezes A, van Oorschot P, Vanstone S (2001) Handbook of Applied Cryptography. CRC Press
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Schaumont, P.R. (2010). Trivium Crypto-Coprocessor. In: A Practical Introduction to Hardware/Software Codesign. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6000-9_11
Download citation
DOI: https://doi.org/10.1007/978-1-4419-6000-9_11
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5999-7
Online ISBN: 978-1-4419-6000-9
eBook Packages: EngineeringEngineering (R0)