ECC on Your Fingertips: A Single Instruction Approach for Lightweight ECC Design in GF(p)

Roy, Debapriya Basu; Das, Poulami; Mukhopadhyay, Debdeep

doi:10.1007/978-3-319-31301-6_9

ECC on Your Fingertips: A Single Instruction Approach for Lightweight ECC Design in GF(p)

Debapriya Basu Roy¹⁵,
Poulami Das¹⁵ &
Debdeep Mukhopadhyay¹⁵

Conference paper

1827 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9566))

Abstract

Lightweight implementation of Elliptic Curve Cryptography on FPGA has been a popular research topic due to the boom of ubiquitous computing. In this paper we propose a novel single instruction based ultra-light ECC crypto-processor coupled with dedicated hard-IPs of the FPGAs. We show that by using the proposed single instruction framework and using the available block RAMs and DSPs of FPGAs, we can design an ECC crypto-processor for NIST curve P-256, requiring only 81 and 72 logic slices on Virtes-5 and Spartan-6 devices respectively. To the best of our knowledge, this is the first implementation of ECC which requires less than 100 slices on any FPGA device family.

D. Mukhopadhyay—This work was partially supported by project from Defence Research and Development Organization (DRDO), India [Sanction No: ERIP/ER/1100420/M/01/1517].

Download conference paper PDF

1 Introduction

With the recent boom in ubiquitous computing, specially in Internet-of-Things (IoT), the need of lightweight crypto-algorithms, either at algorithmic or implementation level, has increased significantly. Though the researchers have proposed various lightweight symmetric ciphers, the most popular options for public key cryptography are RSA and Elliptic Curve Cryptography (ECC). ECC based crypto-system is being preferred over its counterpart RSA because of its wonderful property of increased security level per key bit over RSA. Any ECC based protocol or algorithm is based on underlying elliptic curve scalar multiplication whose computation is based on a number of field operations, making it computationally extensive. Software implementations of ECC, running on smart cards or AVR are slow and can become performance bottleneck for many applications. As an alternative, dedicated ECC-crypto processors are being built on hardware platforms like ASICs (Application Specific Integrated Circuits) and FPGAs (Field Programmable Gate Arrays).

Although ASIC implementations are faster than those based on FPGAs, FPGAs are sometimes preferred over ASIC for cryptographic applications due to its inherent properties of reconfigurability, short time to market and in house security. The entire design cycle of an FPGA based system can be completed inside a single lab unlike ASIC based systems where several different parties are involved in the design cycle. Moreover, modern FPGAs with various device families provide interesting design choices to the designer. Additionally, these FPGAs are now equipped with dedicated hard IPs like DSP blocks, Block RAMs, which when properly utilized results in efficient design of dedicated ECC-based crypto-processors in GF(p) with improved timing performance and reduced area overhead.

There have been many works in the literature which focus on efficient implementation of ECC crypto-processor in GF(p) on FPGAs. An overview of such implementations can be found in [1]. A lightweight ASIC design was reported in [2]. Considerably high speed designs for FPGAs can be found in [3] which is significantly faster than previous designs reported in [4, 5]. But, though the proposed design requires much less area compared to the previous designs (1715 logic slices on Virtex-4 platform for NIST P-256), it is still considerably large for lightweight applications. A fast pipelined modular multiplier for ECC field multiplication was proposed in [6], whereas optimized tiling methodology targeting rectangular DSP blocks of Virtex-5 FPGA was proposed in [7]. However, both of them have considerable area overhead, hence can not be applied in lightweight applications.

A lightweight ECC algorithm for RFID tags was presented in [8] and authentication and ID transfer protocols based on lightweight ECC was introduced in [9]. On implementation level, authors have proposed a lightweight architecture, known as Micro-ECC, in [10]. The proposed design methodology shows significant improvement in terms of area-time product compared to the previous implementation [11–13]. However, Micro-ECC was implemented on Virtex-II platform which is no longer a recommended design platform by Xilinx [14]. Moreover, unlike [11–13], Micro-ECC architecture does not support generalized ECC scalar multiplication on any prime field. Nevertheless, for fixed P-256 and P-224 curve, the performance of Micro-ECC outperforms other by big margin. Lightweight implementation of IPsec protocols comprising implementation of lightweight block cipher PRESENT, lightweight hash function PHOTON and ECC crypto-processor (P-160 and P-256) was presented in [15]. The ECC implementation requires 670 logic slices on Spartan-6 platform for NIST P-256 curve. Consequently, a lightweight architecture supporting both RSA and ECC along with some side channel countermeasure was proposed in [16]. The slice consumption of the proposed design is 1914 logic slices on Virtex-5 platform which is quite low considering dual support of RSA and ECC, provided by the design. As an alternative of standard NIST specified curves, many researchers have recommended use of Edward curve and hyper elliptic curve (HECC). Efficient lightweight implementation of ECC scalar multiplication on such curves can be found in [17, 18].

In this paper, we want to propose an alternative single instruction approach for designing lightweight ECC scalar multiplier which has not been adopted in the previous works. It is well known that using a single instruction like SBN (subtract and branch if negative), SUBLEQ (subtract and branch if the answer is negative or equal to zero), we can construct a Turing complete computer processor. However, though single instruction processor can execute any arithmetical or logical operation, the execution time of some operations become so large that it can not be used in practical scenarios. Hence, a stand alone URISC processor can not be used to design computationally intensive ECC applications. However, in this paper we will show that using the dedicated hard-IPs of FPGA, and with some simple modification of a URISC processor, it is possible to design an immensely lightweight and yet practical ECC architecture.

This architecture is extremely lightweight and to our best of knowledge this is the first implementation of ECC scalar multiplication which requires less than 100 slices on Virtex-5 and Spartan-6 platform. This significant reduction in slice consumption has been achieved by the lightweight architecture of single instruction processor along with intensive usage of hard-IPs of the modern FPGAs. ECC scalar multiplication execution requires to compute and store multiple temporary variables along with the inputs and outputs. This contributes to significant number of register usage and hence increases the slice consumption. In this paper, we will show an alternative design approach where we intensively use the block RAMs and reduce the slice consumption significantly. Further reduction is obtained by replacing the LUT logics with high speed DSP blocks whenever possible. The strategy of using block RAMs to reduce the slice consumption has already been applied for lightweight block ciphers like PRESENT [19], where the authors have shown that block RAM based block cipher design can be extremely lightweight resulting in more slices left for other applications.

Thus the contribution of the present paper can be listed as below:

We propose a single instruction ECC crypto-processor for NIST P-256 curve, and analyze various challenges along with their solutions that a designer will face while applying single instruction approach in the context of lightweight implementation of ECC designs.
We show that single instruction based ECC crypto-processor, coupled with intensive usage of block RAMs and DSP blocks, can yield extremely lightweight design for ECC scalar multiplication execution. The proposed processor requires less than 100 slices on both Virtex-5 and Spartan-6 family and involves thorough usage of FPGA hard-IPs.

The rest of the paper is structured as below: Sect. 2 gives a very brief introduction of ECC and single instruction processor. Section 3 gives a detailed description of single instruction processor along with the modifications required for efficient ECC scalar multiplication. Consequently, Sect. 4 focuses on the architecture of the proposed ECC crypto-processor. Next, in Sect. 6, we discuss the timing and area performance of our design followed by conclusion in Sect. 7.

2 Preliminaries

In this section, we will give a brief summary of ECC and single instruction processors.

2.1 Elliptic Curve Cryptography

As we have previously mentioned, elliptic curve cryptography (ECC) is a public key cryptography based on elliptic curves and finite field. Security of ECC depends upon the mathematical intractability of discrete logarithm of a point in elliptic curve with respect to a known base point.

ECC in finite field GF(p) is defined by the following equation

$$\begin{aligned} \small {\begin{matrix} y^2=x^3+ax+b; a,b\in GF(p), b\ne 0. \end{matrix}} \end{aligned}$$

(1)

Scalar multiplication is the most important operation in ECC for performing key agreement or digital signature schemes. Given a point P on an elliptic curve and a scalar k, scalar multiplication is computed by adding the point P, k times. The basic algorithm used for scalar multiplication is Double-and-Add algorithm, defined in Algorithm 1 in Appendix A, which shows that scalar multiplication is executed by a repeated sequence of point doubling and point additions. It is advantageous to use standard projective coordinates [20] for ECC scalar multiplication as it requires less number of field inversion operations compared to affine coordinate system. In this paper, we have used standard projective coordinates during implementation of ECC scalar multiplication.

Now, each point addition and point doubling operation involves multiple field multiplication operation, making it most critical operation for efficient scalar multiplication execution. NIST specified curves are efficient for hardware implementation as modular reduction operation in those curves are simple as it involves a combination of few addition and subtraction. The fast modular reduction algorithm for NIST P-256 is shown in Appendix A.

In our proposed design, we have concentrated on the NIST P-256 curve. Nevertheless, our approach can be extended to other NIST certified curves also.

2.2 Single Instruction Processor

The concept of single instruction computer or one instruction set computer (OISC) was first proposed in [21]. It has been shown in [22] that using just a single instruction it is possible to create a Turing complete machine. The idea of applying URISC on cryptographic applications was proposed in [23]. In the similar direction, application of one instruction set computer on encrypted data computation was analyzed in [24], but in that paper the authors have investigated OISC in the context of homomorphic encryption and have not considered elliptic curves, which is the precise objective of the present paper.

A standard single instruction processor can be designed by instruction like

1.
ADDLEQ (Add the operands and branch if the answer is less than or equal to zero)
2.
SUBLEQ (Subtract the operands and branch if the answer is less than or equal to zero)
3.
SBN (Subtract the operands and branch if the answer is less than zero)
4.
RSSB (Reverse subtract and skip if borrow)
5.
SBNZ (Subtract the operands and branch if the answer is non-zero)

The main advantage of OISC is that we don’t need any instruction decoding mechanism, which makes the processor architecture exceptionally simple and lightweight. The instruction format of a standard OISC is shown in Fig. 1.

For the present work, we have chosen SBN as the single instruction. However, the proposed design strategy can be tweaked to adopt any of the above described instructions. The operation of SBN instruction is described in Table 1 (code 1.1):

Table 1. SBN and addition using SBN

Full size table

Using this instruction, we can execute any mathematical, logical, flow-control, memory control or load-store type of instruction. For example, in Table 1 (code 1.2), we will show how to perform addition of two operands using SBN instruction.

In this section we have given a brief idea about elliptic curves and OISC. In the next section, we will go into more details of OISC based on SBN instruction and will analyze it from the point of view of elliptic curve applications.

3 SBN-OISC and Elliptic Curve Scalar Multiplication

In the previous subsection, we have given a brief idea about the ECC and OISC, based on SBN instruction (from hereafter we will refer this as SBN-OISC). In this section we will focus more on SBN-OISC in the context of ECC implementation. We will identify the critical challenges that the designer will face while implementing ECC using SBN-OISC and will provide the solutions to tackle those challenges. We will first describe a stand-alone SBN-OISC processor in the next subsection

3.1 Stand-Alone SBN-OISC Processor

A stand-alone SBN-OISC processor is shown in Fig. 2. The main components of a SBN processor are characterized below:

Instruction Memory: Instruction memory stores the instructions to be executed and can be implemented on FPGA using block RAMs, configured as single port ROM. In the Fig. 2, the instruction memory can store up to $2^{11}$ number of instructions and each instruction is 21 bits wide. The format of the instruction is similar to Fig. 1, where address of both the operands are 5 bits wide and the length of the jump address is 11 bits.
Data Memory: Data memory stores the final result of any computation, along with the input and all the temporary results, required during the computation. This has been implemented using block RAM, configured as true dual port RAM. The data memory has space of 32 entries, each of which are 260 bits wide. While implementing scalar multiplication in NIST P-256, the partially modular reduced output can be of size 259 bits which can be represented by 260 bits signed representation. Hence we have chosen the data path to be 260 bits wide.
ALU: Arithmetical logical unit (ALU) of SBN-OISC contains a subtracter, which computes difference between the two inputs. If the result is negative, program counter gets updated by the jump address, specified in the instruction. Otherwise, the program counter gets updated by the immediate next instruction.

The above described architecture is simple and extremely lightweight, requiring 66 logic slices on a Virtex-5 platform. But, as we will show in the next subsection, further optimization of ECC operation can be achieved by introducing different variants of SBN instruction. In the next subsection, we will mainly concentrate on different variants of SBN instructions and will discuss how these different versions of SBN can accelerate ECC implementation.

3.2 Instruction Level Optimizations

Generally, though an OISC processor executes only a single instruction, it is possible to realize different versions of that single instruction to accelerate the desired operation. This approach helps us to reduce the size of instruction memory and consequently, results in faster execution of the aimed design. This is extremely helpful for computationally intensive ECC applications, as illustrated in the following discussion.

Switching Off Memory Write-Back. When we consider traditional SBN instruction (SBN A, B, C), the memory location A always get updated by the result $D.Mem[A]-D.Mem[B]$ (D.Mem is the data memory). But we can reduce the required number of instruction count considerably, if we can switch off this memory write-back operation in some cases.

Let us consider a prime field addition operation. We assume that we need to add two operands stored at memory location A and B and the modulus of the field is stored at memory location P. In Table 2 (code 1.3) shows the realization of this operation using SBN. In this case, we can see that to implement prime field addition we will require 11 SBN instructions. Now, if each SBN instruction execution takes n clock cycles, total clock cycles requirement for field operation will be 11n clock cycles.

Now, let us consider the scenario shown in Table 2 (code 1.4), where we consider two variations of SBN instruction: SBN$_{nw}$ and SBN$_w$. $SBN_w$ A, B, C instruction is similar to normal SBN instruction, where memory location A get updated by the value $D.Mem[A]=D.Mem[A]-D.Mem[B]$. But in case of $SBN_{nw}$ A, B, C, memory location A does not get updated and continue to store the previous value. If we use a combination of $SBN_w$ and $SBN_{nw}$ to implement prime field addition, we will need only 7 instruction as shown in Table 2 (code 1.4). Thus, we have a saving of 4 instructions if we use the strategy depicted in Table 2 (code 1.4). Similar saving can be obtained for field subtraction operation also. Now, in the case of ECC scalar multiplication, where for each key bit we need to do point doubling and if the key bit is 1, we need to do point addition, this saving translates into significant speed up. Each point doubling operation involves 11 field addition and each point addition operation requires 7 field addition. Considering a random distribution of key value for NIST P-256 curve, containing 128 bits of zero and 128 bits of one, we can save around $256\times 11\times 4 + 128 \times 7 \times 4=14848$ number of instructions, which is quite large.

Table 2. Field addition using different SBN instructions

Full size table

Right Shift on SBN Processor. Right shift is an important operation for elliptic curve scalar multiplication execution as it is required during the field inversion operation. Right shift operation can be executed through SBN instruction by repeated subtraction of the operand. For example, if we wish to right shift an operand by 1 bit position, we need to subtract the operand by 2 until the subtraction result become less than 2. Now as we are concentrating on NIST P-256 curve, the operands are typically 256 bits long, making the sequence of repeated subtraction operation extremely time consuming. On the other hand, shifter design on the FPGA has zero LUT overhead if the number of bits to be shifted are fixed. Hence, it is better if we implement right shift operation using a dedicated right shifter module instead of using SBN.

To facilitate this in our architecture, we have introduced another flag ($SBN_{rs}$ and $SBN_{\overline{rs}}$) in our instruction format. When this flag is set, the dedicated right shifter module reads the operand and shift it right by one bit position.

Shifting Key Register. As we have stated in Algorithm 1, the elliptic curve scalar multiplication operation involves point addition and point doubling operation. Point doubling happens for every key bit, but point addition happens only when the key bit value is one. Hence we need to scan the key value bit by bit to execute scalar multiplication operation. On a standard processor this can be implemented using shift and logical AND operation. However, executing logical operations using only SBN instruction is again time consuming and hence practically infeasible.

To solve this challenge, we have used a dedicated key register, separate from the data memory shown in Fig. 2. Also we have introduced another flag in our instruction format ($SBN_{ks}$ and $SBN_{\overline{ks}}$), which when enabled will left shift the key register by one bit. The shifted out bit from the key register will decide whether point doubling or point addition will occur.

Multiplication Using SBN. Field multiplication using SBN is carried out by repeated addition. For example to multiply operand A with Operand B we need to add operand A, B times. Now we have already shown how to implement field addition using SBN in Table 2. To complete the multiplication operation, we need to run that code, B times using a loop. Now, in the worst case scenario, the operands value in NIST P-256 curve are in the range of $2^{256}$, which makes repeated addition implementation impractical as the loop need to run $2^{256}$ times. Hence, we can not implement field multiplication using only SBN for ECC scalar multiplication.

To solve this problem, we have designed a lightweight multiplier using DSP blocks, which acts as an external multiplier core and execute the field multiplication operation. However, to reset this multiplier core and to provide operand data to the multiplier we need another variant of SBN instruction, which we refer as $SBN_{mul}$ and $SBN_{\overline{mul}}$. The $SBN_{mul}$ instruction resets the multiplier, whereas $SBN_{\overline{mul}}$ initiates the multiplication operation. The detailed description of this external multiplier core along with its interfacing with the SBN-OISC processor is provided in the next section.

In this section, we have discussed about different variations of SBN instruction, that is required for optimized ECC implementation. The list of these variants can be found in Table 3, where we have combined the discussed SBN instruction variations. It should be noted that when we reset the multiplier we don’t need any memory write-back, as ALU output does not matter in that situation. Similarly when we are shifting the key register or doing the right shift operation, no memory write-back is needed.

Table 3. Different variant of SBN instruction

Full size table

To adopt these variations of SBN instructions in our architecture we also need to modify the instruction format. The modified instruction format is shown in Fig. 3. In the next section, we will discuss the modified SBN architecture which can support these instruction variants, along with field multiplier architecture. We would like to stress that though we are introducing different variants of SBN instruction, we are still using same ALU for each of this variant. Hence these variants are part of the same SBN instruction, with different flag values as shown in Table 3.

4 Lightweight Field Multiplier for SBN-OISC

As we have stated in the previous sections, we need to provide a dedicated light weight multiplier core to the SBN-OISC processor for efficient execution of the ECC operations. In this section we will focus on the architecture of this dedicated field multiplier and will describe the design strategies behind the proposed filed multiplier methodology.

The architecture of the field multiplier is shown in Fig. 4. As we can see, the architecture requires two DSP blocks, one for integer multiplication and another one for modular reduction operation. DSP blocks of Virtex-5 FPGA can support $25 \times 18$ signed multiplication. It can also provide 48 bit adder/accumulator support. For our implementation, we have used DSP block as $16 \times 16$ unsigned multiplier, configured in multiply and accumulate mode. Moreover, during addition operation, DSP block is configured as 32 bit adder.

We will first focus on the integer multiplier and will follow it with a discussion on the modular reduction operation.

4.1 Integer Multiplication

The integer multiplier receives two 256 bits long operands as input. The operands are divided into 16 bit words and are passed to the first DSP block through two multiplexers. The DSP block is configured in multiply and accumulate mode and support two different operations. In the first operation, DSP block computes $A*B+P$ where A and B are two multiplexer output and P is the accumulator output. This operation computes the summation of the partial products which are aligned with each other. Let us illustrate this with a small example in Eq. 2.

Let us consider a 32 bit multiplication of two operands $R(=r_12^{16}+r_0)$ and $S(=s_12^{16}+s_0)$, divided into 16 bit words. In this scenario the addition of partial products $r_1s_0$ and $r_0s_1$ are carried out by the operation $A*B+P$ as these partial products are aligned to each other. But for the partial products which are shifted, DSP blocks operate using the second instruction $A*B+C$, where $C=P>>16$. The result is stored in memory of dimension $16\times 32$ which is implemented using a block RAM configured as true dual port RAM.

$$\begin{aligned} R \times S =\sum _{j=0}^{1}r_j2^{j*16} \times \sum _{i=0}^{1}s_i2^{i*16} \; = r_0s_0+ (r_1s_0+r_0s_1)2^{16}+r_1s_12^{32} \end{aligned}$$

(2)

The integer multiplication requires 256 iteration of the DSP block, along with three clock cycles for updating the data memory. Hence the total clock cycle count for integer multiplication is 259.

4.2 Modular Reduction

Once the memory is loaded with the integer multiplication result, modular reduction operation is initiated. For NIST curves, modular reduction operation requires a combination of addition and subtraction operation as shown in Algorithm 2 in Appendix A. Now in Algorithm 2, the modular reduction operation needs to add operands $T,S_1,S_2,S_3.S_4$ and subtract $D_1. D_2, D_3, D_4$ from them. We have separated the operands in 32 bit words and have used a DSP adder to execute the addition/subtraction operations. The memory produces 32 bits of output in a single clock cycle, which are added or subtracted depending on the control signal add/sub. Like the previous DSP blocks, this one also supports two operation: $P \pm C$ and $C+CONCAT$, where $CONCAT=P>>32$ and P is the accumulator output. The first operation does the addition or subtraction of a 32 bit operand with the accumulator result, and the second operation is required to add the carry bits generated from the previous additions.

The addition and subtraction sequence of the operands are decided by the modular reduction algorithm for NIST P-256 curve, shown in Appendix A. Moreover, the produced result is not fully reduced but is within the range [$-4p,5p$] [3], where p is the modulus of the curve. The total clock cycles required for this partial modular reduction operation is 68, making the total clock cycle requirement for field multiplication 327.

As we have shown in the Fig. 4, our architecture is also coupled with a dedicated right shifter module. Now, when the rs flag is set high, the design will produce the right shifter output of the input operand a. Otherwise, it will produce the output of the field multiplier.

5 Complete ECC SBN-OISC Processor

In this section, we will present the detailed description of our proposed ECC SBN-OISC processor. The complete architecture of the processor is shown in Fig. 5. The architecture and the working of the proposed processor is nearly similar to the stand alone SBN processor shown in Fig. 2 with some few modifications which are described below.

The ECC SBN-OISC processor is coupled with the multiplier core described in the previous section. Multiplier core is initiated by the mul flag of the instruction. As long as the mul flag is set to one, the multiplier stays in its initial stage. Once it is set low, the multiplier starts its operation and produces the partially reduced output along with signal web which indicates the completion of multiplication operation. In the stand alone SBN (Fig. 2), the data memory is updated only through port A. But in our case, we are also using the unused port B for writing the multiplier output into the memory. It must be noted that when the rs flag is set high, the multiplier module produces right shifted output of input, available through port A.

As we have mentioned earlier, we introduced a flag ks in our instruction format for shifting the key register. Key is stored in a different register which goes though a single bit left shift when ks flag is set high. If the MSB of the key bit is one, we select the address of the memory location containing value 1 (addr_1) and pass it to the data memory. Otherwise if the MSB bit is zero, we select the memory location containing value 0 (addr_0). Once this is done we can easily switch between point doubling and addition operation depending upon the memory location passed to the data memory.

The ALU of the proposed SBN-OISC processor is a subtracter, implemented through cascaded DSP blocks. The subtraction operation requires 6 clock cycles to be completed. Instruction fetch, memory read and memory write-back require single clock cycle for each operation. Hence total clock cycle required for a single SBN instruction requires 9 clock cycles.

6 Result and Comparison

In this section we will analyze the performance of the proposed ECC SBN-OISC processor in terms of timing and area. Table 4 shows the timing and area performance of the proposed processor. As we can see, the slice count required by the design for both Virtex-5 and Spartan-6 is very small. This is achieved by in-depth usage of block-RAMs and DSP blocks. The stand alone SBN processor is itself very lightweight, and the dedicated multiplier core is designed by judicious use of DSPs and block RAMs making the slice count extremely small. The block RAMs are used to implement both data and instruction memory of the SBN-OISC processor. Moreover all the temporary storages along with control units are also implemented through block RAMs which increases the block RAM consumption, but reduces the slice count considerably. A designer can choose a budget of slices and block RAMs and then can design the ECC crypto-processor according to that budget. In this paper, we wanted to explore the limit up-to which we can reduce the slice count by increasing the block RAM usage. The result in Table 4 shows that saving is significant in terms of slice usage and hence the objective of the paper is achieved.

Table 5 shows the comparison with the previous results. Among the previous work, the design proposed in [3] targets high speed architecture and is not intended for lightweight applications. Apart from that, the proposed ECC SBN-OISC processor shows comparable performance in terms of area and time product. But it is unfair to directly compare the proposed design and the previous designs [10, 11] as they were implemented on Virtex-II pro which is extremely inefficient in comparison with Virtex-5 device family. However as FPGA devices has evolved significantly in the last decade, there is a need to update design strategies which will be efficient on these modern FPGAs. Additionally, these old FPGA families are no longer recommended for new designs by Xilinx. Motivated by these reasons, we have chosen Virtex-5 and Spartan-6 as our implementation platform, as these two FPGA family though not much new, are equipped with most of the modern hard-IPs, present in the FPGAs. The proposed processor is also much faster when compared with lightweight software libraries for ECC like TinyECC [25]. The developed architecture is the first implementation which has reduced the slice requirement of an ECC processor to be less than 100 on Virtex-5 and Spartan-6 device family. The results shown here are obtained after post place and route analysis on Xilinx ISE.

Table 4. Area and timing performance of the proposed ECC SBN-OISC processor

Full size table

Table 5. Comparison of ECC SBN-OISC processor with existing designs

Full size table

7 Conclusion

In this paper we have merged two design strategies to create an extremely light-weight ECC crypto-processor for scalar multiplication in NIST P-256 curve. The first strategy was to use a single instruction processor (ECC SBN-OISC processor) to create lightweight framework for ECC scalar multiplication. Then we have equipped this processor with dedicated field multiplier along with some simple modification of the processor architecture and instruction format to make the scalar multiplication operation practical time feasible. The second strategy is to use the dedicated hard-IPs of the FPGA to reduce the slice consumption further. We have shown that by thorough usage of DSP blocks and block RAMs, the slice requirement decreases significantly. For Virtex-5 and Spartan-6, we have been able to achieve less than 100 slice consumption. To the best of our knowledge, this is the first implementation which has been able to achieve this feat.

References

Daly, A., Marnane, W., Kerins, T., Popovici, E.: An FPGA implementation of a GF(p) ALU for encryption processors. Microprocess. Microsyst. 28(56), 253–260 (2004). Special Issue on FPGAs: Applications and Designs
Article Google Scholar
Batina, L., Mentens, N., Sakiyama, K., Preneel, B., Verbauwhede, I.: Low-cost elliptic curve cryptography for wireless sensor networks. In: Buttyán, L., Gligor, V.D., Westhoff, D. (eds.) ESAS 2006. LNCS, vol. 4357, pp. 6–17. Springer, Heidelberg (2006)
Chapter Google Scholar
Güneysu, T., Paar, C.: Ultra high performance ECC over NIST primes on commercial FPGAs. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008. LNCS, vol. 5154, pp. 62–78. Springer, Heidelberg (2008)
Chapter Google Scholar
Satoh, A., Takano, K.: A scalable dual-field elliptic curve cryptographic processor. IEEE Trans. Comput. 52, 449–460 (2003)
Article Google Scholar
Orlando, G., Paar, C.: A scalable $GF$($p$) elliptic curve processor architecture for programmable hardware. In: Koç, Ç.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 356–371. Springer, Heidelberg (2001)
Google Scholar
Alrimeih, H., Rakhmatov, D.: Pipelined modular multiplier supporting multiple standard prime fields. In: 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 48–56, June 2014
Google Scholar
Roy, D.B., Mukhopadhyay, D., Izumi, M., Takahashi, J., Multiplication, T.B.: An efficient strategy to optimize DSP multiplier for accelerating prime field ECC for NIST curves. In: The 51st Annual Design Automation Conference, DAC 2014, San Francisco, CA, USA, 1–5 June 2014, pp. 177:1–177:6 (2014)
Google Scholar
Kim, C.-J., Yun, S.-Y., Park, S.-C.: A lightweight ECC algorithm for mobile RFID service. In: Proceedings of the 5th International Conference on Ubiquitous Information Technologies and Applications (CUTE 2010), pp. 1–6, December 2010
Google Scholar
He, D., Kumar, N., Chilamkurti, N., Lee, J.-H.: Lightweight ECC based RFID authentication integrated with an ID verifier transfer protocol. J. Med. Syst. 38(10), 116 (2014)
Article Google Scholar
Varchola, M., Güneysu, T., Mischke, O.: MicroECC: a lightweight reconfigurable elliptic curve crypto-processor. In: International Conference on Reconfigurable Computing and FPGAs, ReConFig 2011, Cancun, Mexico, November 30–December 2, 2011, pp. 204–210 (2011)
Google Scholar
Vliegen, J., Mentens, N,. Genoe, J., Braeken, A., Kubera, S., Touhafi, A., Verbauwhede, I:. A compact FPGA-based architecture for elliptic curve cryptography over prime fields. In: 21st IEEE International Conference on Application-Specific Systems Architectures and Processors, ASAP 2010, Rennes, France, 7–9 July 2010, pp. 313–316 (2010)
Google Scholar
Tawalbeh, L.A., Mohammad, A., Gutub, A.A.-A.: Efficient FPGA implementation of a programmable architecture for GF(p) elliptic curve crypto computations. Signal Process. Syst. 59(3), 233–244 (2010)
Article Google Scholar
Ghosh, S., Alam, M., Chowdhury, D.R., Gupta, I.S.: Parallel crypto-devices for GF(P) elliptic curve multiplication resistant against side channel attacks. Comput. Electr. Eng. 35(2), 329–338 (2009)
Article MATH Google Scholar
Xilinx Inc.: Virtex-II and Virtex-II Pro X FPGA User Guide, 14 February 2011
Google Scholar
Driessen, B., Güneysu, T., Kavun, E.B., Mischke, O., Paar, C., Pöppelmann, T.: IPSecco: a lightweight and reconfigurable IPSec core. In: International Conference on Reconfigurable Computing and FPGAs, ReConFig 2012, Cancun, Mexico, 5–7 December 2012, pp. 1–7 (2012)
Google Scholar
Pöpper, C., Mischke, O., Güneysu, T.: MicroACP - a fast and secure reconfigurable asymmetric crypto-processor. In: Goehringer, D., Santambrogio, M.D., Cardoso, J.M.P., Bertels, K. (eds.) ARC 2014. LNCS, vol. 8405, pp. 240–247. Springer, Heidelberg (2014)
Chapter Google Scholar
Himmighofen, A., Jungk, B., Reith, S.: On a FPGA-based method for authentication using edwards curves. In: 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), Darmstadt, Germany, 10–12 July 2013, pp. 1–7 (2013)
Google Scholar
Fan, J., Batina, L., Verbauwhede, I.: Light-weight Implementation options for curve-based cryptography: HECC is also ready for RFID. In: ICITST, pp. 1–6. IEEE (2009)
Google Scholar
Kavun, E.B., Yalcin, T.: RAM-based ultra-lightweight FPGA implementation of PRESENT. In: International Conference on Reconfigurable Computing and FPGAs (ReConFig 2011), pp. 280–285, November 2011
Google Scholar
Hankerson, D., Menezes, A.J., Vanstone, S.: Guide to Elliptic Curve Cryptography. Springer, New York (2003)
MATH Google Scholar
Mavaddat, F., Parhamt, B.: URISC: the ultimate reduced instruction set computer. Int. J. Electr. Eng. Educ. 25, 327–334 (1988)
Article Google Scholar
Gilreath, W.F., Laplante, P.A.: Computer Architecture : A Minimalist Perspective. The Springer International Series in Engineering and Computer Science. Springer, New York (2003)
Book MATH Google Scholar
Naccache, D.: Is theoretical cryptography any good in practice? In: CHES (2010)
Google Scholar
Tsoutsos, N.G., Maniatakos, M.: Investigating the application of one instruction set computing for encrypted data computation. In: Gierlichs, B., Guilley, S., Mukhopadhyay, D. (eds.) SPACE 2013. LNCS, vol. 8204, pp. 21–37. Springer, Heidelberg (2013)
Chapter Google Scholar
Liu, A., Ning, P., Tinyecc,: A configurable library for elliptic curve cryptography in wireless sensor networks. In: IPSN, pp. 245–256. IEEE Computer Society (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Secured Embedded Architecture Laboratory (SEAL), Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India
Debapriya Basu Roy, Poulami Das & Debdeep Mukhopadhyay

Authors

Debapriya Basu Roy
View author publications
You can also search for this author in PubMed Google Scholar
Poulami Das
View author publications
You can also search for this author in PubMed Google Scholar
Debdeep Mukhopadhyay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debapriya Basu Roy .

Editor information

Editors and Affiliations

University of Haifa, Haifa, Israel
Orr Dunkelman
Mount Allison University, Sackville, NB, Canada
Liam Keliher

A Appendix 1

Here we will show two algorithm. The first algorithm is for ECC scalar multiplication using double and add methodology, shown in Algorithm 1.

Next, we will present NIST specified fast algorithm for modular reduction in NIST P-256 curve, shown in Algorithm 2.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Roy, D.B., Das, P., Mukhopadhyay, D. (2016). ECC on Your Fingertips: A Single Instruction Approach for Lightweight ECC Design in GF(p). In: Dunkelman, O., Keliher, L. (eds) Selected Areas in Cryptography – SAC 2015. SAC 2015. Lecture Notes in Computer Science(), vol 9566. Springer, Cham. https://doi.org/10.1007/978-3-319-31301-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-31301-6_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31300-9
Online ISBN: 978-3-319-31301-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Preliminaries

2.1 Elliptic Curve Cryptography

2.2 Single Instruction Processor

3 SBN-OISC and Elliptic Curve Scalar Multiplication

3.1 Stand-Alone SBN-OISC Processor

3.2 Instruction Level Optimizations

4 Lightweight Field Multiplier for SBN-OISC

4.1 Integer Multiplication

4.2 Modular Reduction

5 Complete ECC SBN-OISC Processor

6 Result and Comparison

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix 1

A Appendix 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation