Keywords

1 Introduction

Low-cost cryptographic implementations have been more and more attractive for modern applications. A low-cost implementation also means the reduction of consumed resources and power, which is necessary for these constrained scenarios. Although the available resources are strictly constrained, public-key cryptography (PKC) involving burdensome arithmetic is usually required due to its advantages over symmetric cryptography. Compared with RSA or other PKC algorithms over finite fields, elliptic curve cryptography (ECC) uses a much shorter key to achieve an equivalent level of security. Therefore, ECC implementations are preferred for resource-constrained applications owing to the lower computational complexity and other nice properties such as reduced storage and power consumption. Compared with software and ASIC (Application Specific Integrated Circuit) implementations of ECC, FPGA implementation is a better choice in the trade-off between execution speed and development period. In addition, FPGAs are often used as the early-validation platforms for the ASICs, thus have important research values.

ECC designs can be roughly split into two categories: over (extended) binary fields \(\mathbb {F}_{2^n}\) and over prime fields \(\mathbb {F}_p\), and also some designs support both the two. The first category offers better performance and lower resource consumption, mostly because no carry is propagated in the field structure. That is why most area-optimized designs focus on binary field implementations. Nevertheless, prime fields also have significant value either in the applications, such as digital signature generation, or in the standards of elliptic curves [4, 5, 16]. The NIST curves over binary or prime fields have significant advantages in area and speed for hardware implementations, as fast reduction methods can be adopted and the parameters are fixed. Therefore, most of the low-area targeted designs [1, 6, 8, 11, 18, 19, 21, 23] (especially in ASICs) are focusing on these standardized curves. These implementations are efficient for elliptic curve point multiplications (ECPMs), but for further applications (such as digital signature generation), they are hard to be complete these functions alone. For example, the Elliptic Curve Digital Signature Algorithm (ECDSA) needs another prime (the order of the base point) field operations for the final operation, but fast reduction methods cannot be applicable for that prime which is not special. Existing implementations based on the fast reduction have to specifically add the scheduling instructions and improve the hardware arithmetic unit for supporting the signature generation, such as [18].

In this paper, in order to better support the applications based on ECC (such as curve transition, or signature generation and key agreement which require the operations over another prime field) for area-constrained scenarios, we propose a low-cost hardware ECC implementation for general prime fields based on Montgomery modular multiplications. Besides supporting the operation over another prime field under the same curve, the implementation also supports the transition of different curves (of the same length) without reconfiguration. For example, if needed, the user can switch the underlying curve from the NIST curve P-256 to the SM2 elliptic curve (Chinese ECC standard [16]) through writing new parameters for meeting certain demands.

The implementation is constructed based on our proposed prime-field arithmetic logical unit (ALU) for modular arithmetic, which is able to perform the operations of Montgomery modular multiplications, additions and subtractions with high compatibility. For the sake of efficiency, the ALU is designed to be a high-radix architecture. Particularly, we employ Shift Register Look-up Table (SRLs) in FPGAs to implement long registers to minimize the occupied area, and eliminate the operations between additions/subtractions and Montgomery multiplications to decrease the control circuits. Furthermore, we maximize the frequency of the ALU with the help of the dedicated DSP (Digital Signal Processing) blocks in modern FPGAs, and the execution efficiency of the ECC implementation is significantly improved. In the higher level, we optimize the scheduling process in the point addition to improve the use efficiency of the ALU. Finally, we implement the design in Spartan-6 FPGA platform. The design only occupies 105 Slices, 2 DSP blocks and 2 BRAMs in Spartan-6 FPGA with a low computation latency. Comparison results indicate that our implementation outperforms the existing works in FPGAs over prime fields in the aspect of area, and a lot of logic Slices or dedicated cores are saved.

The rest of this paper is organized as follows. Section 2 presents the preliminaries for elliptic curve cryptography. Section 3 describes the hardware architectures of the ALU and the ECC processor. Section 4 gives implementation results in FPGAs. Section 5 presents the comparison results with related work. Section 6 concludes the paper.

2 Elliptic Curve Cryptography

The elliptic curve is defined over a field \(\mathbb {K}\) given by the Weierstrass equation:

$$\begin{aligned} E: y^2 + a_1 xy +a_3y = x^3 + a_2x^2 + a_4x + a_6. \end{aligned}$$
(1)

In the case of the characteristic \(Char(\mathbb {K}) \ne 2,3\), the general Weierstrass equation is simplified to

$$\begin{aligned} E: y^2 = x^3 + ax + b. \end{aligned}$$
(2)

In [14], Montgomery developed an original technique to compute multiples of points on an elliptic curve, as shown in Algorithm 1. His technique is based on the fact that the sum of two points whose difference is a known point can be computed without the y-coordinates of the two points. For each iteration in the Montgomery ladder algorithm, \(Q_1-Q_0\) always equals to the base point P, thus can be preset as a constant in advance. As the speed is not the primary optimization goal, we do not adopt more efficient algorithms such as NAF (Non-adjacent Form) or window-based algorithms, but employ the Montgomery ladder algorithm whose control is simpler. In addition, the method is resistant against Simple Power Analysis (SPA).

figure a

Let \(P = (x_1, y_1)\) and \(Q = (x_2, y_2)\in E(\mathbb {F}_q)\) with \(P\ne \pm Q\), and \(P+Q = (x_3, y_3)\), \(2P = (x_4, y_4)\). Given the point \(P-Q = (x', y')\), the x-coordinates of \(P + Q\) and 2P satisfy [2]:

$$\begin{aligned} x_3 = \frac{2(x_1+x_2)(x_1x_2+a)+4b}{(x_1-x_2)^2}-x', \end{aligned}$$
(3)

and

$$\begin{aligned} x_4 = \frac{(x_1^2-a)^2-8bx_1}{4(x_1^3+ax_1+b)}. \end{aligned}$$
(4)

The formulas for point addition and point doubling require a field inversion and several field multiplications. If inversion in \(\mathbb {K}\) is significantly more expensive than multiplication, then it may be advantageous to represent points using projective coordinates.

The projective point \((X : Y : Z), Z\ne 0\), corresponds to the affine point (X / Z, Y / Z). The projective equation of the elliptic curve is

$$Y^2Z = X^3+aX Z^2 +bZ^3.$$

The point at infinity \(\infty \) corresponds to (0 : 1 : 0), while the negative of (X : Y : Z) is \((X : -Y : Z)\). Under the standard projective coordinate, these equations becomes [10]:

$$\begin{aligned} Z_3= & {} (X_1Z_2 + X_2Z_1)^2, \end{aligned}$$
(5)
$$\begin{aligned} X_3= & {} 2(X_1Z_2+X_2Z_1)(X_1X_2+aZ_1Z_2)+4bZ_1^2Z_2^2-x'Z_3, \end{aligned}$$
(6)
$$\begin{aligned} Z_4= & {} 4Z_1(X_1^3 + aX_1Z_1^2 + bZ_1^3), \end{aligned}$$
(7)
$$\begin{aligned} X_4= & {} (X_1^2 + aZ_1^2)^2 - 8bX_1Z_1^3. \end{aligned}$$
(8)

In some cases (such as public key generation), the y-coordinate is required. The y-coordinate of P can be deduced by:

$$y_1 = \frac{2b+(a+x'x_1)(x'+x_1)-x_2(x'-x_1)^2}{2y'}.$$

Note that \(y'\) is the y-coordinate of the base point that is known in advance in Algorithm 1, so \((2y')^{-1}\) can be pre-computed and the inversion is replaced by a multiplication in the equation.

In the original ladder, the addition and doubling are computed separately. For sake of efficiency, Izu et al. [9] encapsulated these formulae into one formula, which outputs x-coordinate values of \(P+Q\) and 2P on inputs P and Q. With a projective version of the x-coordinate-only formulae, \(X_3,Z_3,X_4,Z_4\) can be computed with 17 multiplications and 18 additions. The number of auxiliary variables for the formulae is 7. The y-coordinate recovery algorithm requires 13 multiplications and 7 additions and 7 auxiliary variables [9]. For one ECPM, the y-coordinate recovery is only performed once, thus has negligible impact on the execution efficiency.

3 Hardware Architecture

In general, the compact ECC architecture is usually composed of four parts: the ALU for finite-field arithmetic, data memory, program memory and other control circuits. The core of the design is to simplify the ALU operations and further utilize the ALU efficiently for elliptic curves arithmetic. In this section, we first propose a compact and efficient ALU architecture based on a series of design policies, and then design a low-cost ECC architecture.

3.1 ALU Design

Design Policy. The design of an ALU is crucial for ECC implementations. The purpose of our design is to construct a compact and area-saving architecture. To achieve this goal, we establish the following design policy for the circuit architecture.

  1. 1.

    To support the operations of general prime fields or generic curves rather than specific curves (such as NIST curves) in order to guarantee the flexibility.

  2. 2.

    To set the bus width of the input/output signals to no more than 16 bits in order to simplify the control, and to use single-port RAM rather than dual-port RAM.

  3. 3.

    To integrate all the prime-field operations to use the same computing circuit in order to save consumed resources.

  4. 4.

    To specifically enrich the ALU functions to optimize the scheduling process inside the elliptic curve arithmetic.

Item (1) ensures the flexibility of the ECC implementation, so that it is not only suitable for the NIST primes (such as P-192, P-256) but also for the operations over general prime fields. Item (2) guarantees lower-resource consumption when using RAMs. With narrower width and single port RAM, smaller area of control circuits are consumed. The use of single-port RAM is to improve the transportability of the architecture. Item (3) makes the same circuit perform modular multiplication, addition and subtraction by configuration, which improves the utilization. Item (4) enhances the efficiency for calculating point addition and doubling, as some involved specific operations can be optimally executed by the ALU.

Modular Arithmetic. According to the design policy (1), we choose Montgomery multiplication as the underlying modular multiplication algorithm, rather than the fast reduction method which is only available for pseudo Mersenne primes such as NIST standardized primes [5]. Montgomery multiplication is a method to perform modular multiplication without the need to perform division by the modulus [13]. A version of Montgomery’s algorithm [17] is given as Algorithm 2. This algorithm avoids multiplication and addition in quotient determination to simplify the computation.

figure b

On the observation from Algorithm 2, the step to calculate \(S_{i+1}\) is crucial for this modular multiplication. In order to make this algorithm suitable for hardware implementation, we propose a processing method which uses two w-bit multipliers and a few adders, as shown in Algorithm 3. The long integer \(S_i\) and \(M'\) are split into w-bit blocks. The remaining inputs appearing in Algorithm 2 are omitted.

figure c

In Algorithm 3, \(S_i\) and \(M'\) are divided into n w-bit blocks. \(S_i\) is represented as: \(S_{i} = \{S_{i}^{(n-1)},S_{i}^{(n-2)},\ldots ,S_{i}^{(0)}\}.\) During the initialization procedure, \(S_{0}\) and \((w+1)\)-bit Carry are set to be zero. After the initialization, there are two loops to complete the Montgomery multiplication. For the operands A and B, the outer loop is responsible for the split of B and the inner loop controls the use of w-bit \(a_{j}\). The most resource consuming calculation is in the inner loop, and this calculation is involved with two w-bit multipliers and an addition with four addends. Since the inner loop consumes n clocks and the outer loop has \((n+1)\) times, it takes \(n(n+1)\) clocks to finish one Montgomery multiplication.

Based on the processing method, we add the modular addition and subtraction function by using the existing accumulator in the multiplier. In fact, the Montgomery multiplication has the function of reduction by expanding R. For example, for any input \(A,B<4\widetilde{M}\), when \(R=2^{wn}>4\times 4\widetilde{M}=16\widetilde{M}\), the final result still satisfies \(0<S_{n+1}<2\widetilde{M}.\) Therefore, we do not perform the reduction of the addition results, but only to guarantee the multiplier input in the range of \((0,4\widetilde{M})\). For modular subtraction, we need to add integer multiples of M to the subtraction result when it is negative. The algorithm for modular addition and subtraction is shown in Algorithm 4.

figure d

In the initialization of Algorithm 4, \(S_{0}\) and \((w+1)\)-bit Carry are set to zero. According to the operation flag sub, the accumulator can complete both addition and subtraction. When the operation is modular addition, after the n-time loop, the operation is completed. When the operation is modular subtraction, the XOR operation is activated and ‘1’ is added in the first round. Furthermore, if Carry is not zero after the iteration, which means the final result of S is negative, it requires extra additions with \(\widetilde{M}\) until S is positive.

Design Architecture. The ALU architecture integrating Montgomery multiplication, addition and subtraction operations is depicted as Fig. 1. According the design policy (2), only one data input port is allowed for the ALU. There are three logic calculation units: two w-bit multipliers and one adder with four inputs. Also, there are two long shift registers which are used for storing \(M'\) and the computed result S, and these two registers are shifted w-bit every clock. Here, we carefully minimize the number of states of the shift registers to reduce the control circuits for these massive registers.

The data flow of the multiplication in the ALU is explained as follows. Before the first modular multiplication, we should load the modulus into the shift register \(M^{\prime }\) and the bus width of this data input port is w, so it takes nearly \((n-1)\)-time shifts to finish the loading. Note that the loading operation is only executed once at the beginning of the prime field computation. Algorithm 3 is performed after rightmost w-bit \(b_{0}\) into the ALU register. Then \(a_{0}\), \(a_{1}\), \(\cdot \cdot \cdot \), \(a_{n-1}\) successively enter the left multiplier unit of the ALU figure. The width of the addition result is \(2w+1\) bit, where the higher \(w+1\) bits are fed back to the adder in the next cycle and the rest are put into the S shift registers. Due to the one extra cycle for loading \(b_i\), the total clock cycles for one multiplication increases to \((n+1)^2\).

Fig. 1.
figure 1

The ALU architecture

The addition and subtraction shares the same accumulator of the ALU. The configuration is easily completed by using the multiplexers which are responsible for switching the values of sub and zero. Another advantage of the ALU structure is that it can perform the operation \(\alpha A \pm \beta B\), where \(\alpha ,\beta \in [0,2^{w}-1] \) and \(\beta \equiv 1\) for subtraction. The function is more powerful by combining Algorithm 4, as the result can be immediately input to the ALU for Montgomery multiplication without modular reduction gradually. This is useful to merge the adjacent additions and subtractions in point addition and doubling algorithms, thus saves program commands and consumed time. The required clock cycles for one-time addition/subtraction is \(2(n+1)\).

3.2 FPGA Optimization

In modern FPGAs, the dedicated resources and multifunctional logics allows us to further improve the efficiency of the ALU.

Maximizing Frequency. The maximum frequency of the ALU is limited due to the long critical path, which consists of one multiplier, one adder of four numbers and some multiplexers, as shown in Fig. 1. This could reduce the compatibility of the ALU (or the ECC implementation) with high-speed modules that run at a high frequency in the same FPGA. Therefore, we maximize the frequency of the ALU with the help of the dedicated DSP (Digital Signal Processing) blocks in modern FPGAs, which also significantly improves the execution efficiency of the ECC implementation. In our targeted FPGA platform Xilinx Spartan-6, the DPS block named DSP48A1 is able to perform (mixed) multiplications and subtractions/additions efficiently, and the frequency can achieve very high by inserting the registers (i.e., pipeline) inside DSP blocks.

In the pipeline setting up, to guarantee that the result is computed in each loop after the pipeline is filled, we shall recognize that which variables can be known in advance and which ones cannot be. The critical data path of the ALU in the i-round j-loop is represented as:

$$\{Carry,S_{i+1}^{(j)}\} = Carry + S_{i}^{(j+1)} + q_i*m_j' + a_i*b_j,$$

where the variables except for Carry for the next loop are known before the end of the current loop, thus can be pre-computed. In addition, the result of \(a_i*b_j\) should be negated for the subtraction operation (\(b_j\) always equals 1 for the subtraction), and we also put the negation operation into the DSP blocks by being subtracted with w-bit 1’s.

The data path with pipelines inside DSP blocks is depicted as Fig. 2, which contains two DSP blocks with a three-stage pipeline. The stage latencies are balanced to maximize the frequency. In the 1st stage, the two multiplications are performed. The negation and the addition operations are processed in the 2nd stage, and the two DSP blocks are connected using the cascade connection ports PCOUT and PCIN to decrease the wire delay. The remain three-number addition is performed in the 3rd stage. After the frequency optimization, the maximum frequency of ALU is improved to 200 MHz in our experiment, which is nearly three times than the original. The execution time of the multiplication or addition/subtraction sightly increases by two clock cycles that is caused by filling up the pipeline.

Fig. 2.
figure 2

The optimized data path with pipelines

Minimizing Area. In the ALU, the main consumed resources are divides into three parts: multiplier and adder units, the long shift registers, and the control circuit, and we have put some addition operations in the DSP blocks to improve the resource utilization. Here we further employ the SRLs in FPGAs to implement the long shift registers. SRLs are one type of LUTs that can efficiently implement shift registers. One 6-input SRL can compose 1-bit-width and 64-bit-depth shift registers, while it requires 64 registers in the non-optimizing manner. In addition, as we mentioned, the two groups of shift registers (S and M in Fig. 1) have only two states: suspending and shifting, except for the heading registers that have different inputs. For SRL implementations, each SRL has an input to control the inside registers shifting or not. Therefore, these registers can be efficiently implemented in FPGAs with a ultra-low cost.

3.3 ECC Architecture

Based on the ALU design and the Montgomery ladder algorithm for point multiplication, we design our ECC architecture by utilizing decoders and program commands to control all the required operations over prime fields. The ECC architecture is designed as shown in Fig. 3. Except for the ALU module, two memories are deployed in the architecture. One is the PROGRAM ROM and the other is a DATA RAM which is used for storing intermediate values, the constants, and the final results.

Fig. 3.
figure 3

The ECC architecture

Point multiplication is composed of a series of point addition and point doubling. After the mix of modular multiplications and additions by utilizing the multiple-and-addition function of the ALU, one point addition and doubling needs 17 modular multiplications and 12 modular addition/subtraction. For each modular multiplication or modular addition/subtraction, there is a command to indicate the type of this operator, the addresses for the operands of this operator. Therefore, for one point addition and point doubling there are 29 commands in total and there is a decoder to determine which command is used for next calculation according to k. In order to simplify the decoder, we have two sets of commands in the PROGRAM ROM, one set of 29 commands is for the case in which \(k_{i}\) is 0, and the other set of 29 commands is for the case in which \(k_{i}\) is 1.

Modular inversion is required for transferring the projective coordinates to affine ones or generating the signature. In our implementation, the operation is carried out by utilizing the ALU, and the commands are also stored in the ROM of Fig. 3. We take advantage of Fermat’s little theorem to calculate the inversion, which is

$$\begin{aligned} a^{p-2} \equiv a^{-1} \bmod p. \end{aligned}$$
(9)

Hence, the operation of modular inversion just consumes a few extra memory and control circuits.

4 Hardware Implementation

In this section, we implement the 256-bit ECC architecture in Xilinx FPGAs, and evaluate its area and efficiency.

4.1 Efficiency

For demonstration purposes, we implement 256-bit ECPM of generic curves over prime fields in FPGAs. The parameters of the ALU and ECC processor are set as follows. The modulus M has 256-bit length, the width \(w=16\), \(n=18\), and \(R=2^{wn}=2^{288}\). Hence, the total number of processing cycles for a modular multiplication and addition/subtraction is \((n+1)^2+2 = 363\) and \(2(n+1)+2 = 40\), respectively. The RAM size is set to 256-bit depth and 16-bit width (256\(\,\times \,\)16), and the program ROM size is 128\(\,\times \,\)16.

For 256-bit ECPM, the required clock cycles of \(w=16\) for different operations are listed in Table 1. One 256-bit ECPM has at most 256 point addition & doubling operations for Montgomery ladder algorithm. The inversion in the ECPM is only considered once. The y-coordinate is also calculated at the end of ECPM. As a result, the 256-bit ECPM for generic curves can be completed in about 1847 K clock cycles for \(w=16\).

Table 1. Required clock cycles for the operations (\(w=16\))

4.2 FPGA Implementation

We implement the ECC processor on Xilinx Spartan-6 (XC6SLX45T-4) and Spartan-3E (XC3S100E-5) FPGAs. The two FPGAs are both low-cost-application oriented, while the latter is an old device which is useful for the fair comparison with previous work. The post place and route (PAR) implementation results by ISE 14.6 are shown in Table 2. Note that the differences between internal structure of the two FPGAs make the consumed logics seemingly significantly different. One Slice in Spartan-3E contains two 4-input LUTs and two flip-flops, while the Slice in Spartan-6 contains four 6-input LUTs and four flip-flops. The multiplier resources are multi-functional DSP units in Spartan-6, while they are multiplier hardcores (MULT) in Spartan-3E. It is noted that the functions that were completed in DSP blocks are implemented using LUT logics in Spartan-3E.

Table 2. PAR results of the ECC processor in FPGAs

On Spartan-3E FPGA, 350 Slices, 2 MULTs, and 2 BRAMs are occupied to implement our ECC architecture with \(w=16\), including 520 LUTs and 392 Flip-flops. The maximum frequency on Spartan-3E is 145.6 MHz, and it needs 12.7 ms to finish one ECPM. In Spartan-6 FPGA, it costs only 105 Slices (including 332 6-input LUTs and 382 flip-flops), 2 DSP blocks and 2 BRAMs. One ECPM consumes only 9.2 ms with \(w=16\) in Spartan-6. From these results, it is observed that our low-cost ECC architecture has an excellent performance in FPGAs: very small area with a low computation latency.

5 Related Work and Comparison

Many low-cost ECC implementations are targeted to ASICs. Because the area is extremely constrained and the application is specific, the ASIC implementations use standardized curves that support the fast reduction, such as [1, 6, 8, 11, 18, 23]. Different with their underlying methods, we choose the Montgomery multiplication as the modular method, which is more complex than the fast reduction, but is more flexible. This allows us to support generic curves or different prime fields under the same curve without reconfiguration. Hence, we do not perform the comparison with ASIC implementations due to the differences in the underlying platform and the design goal.

Table 3. Comparison of ECC implementations in FPGAs (256-bit prime field)

Table 3 lists related works for ECC implementations in FPGAs, where the former three [19, 21, 22] focused on compact implementations and others [3, 7, 12, 15, 20, 24] on high-speed implementations. Especially, [7, 19, 21] optimized the implementation efficiency for NIST prime P-256. Vliegen et al. [22] presented a compact FPGA-based architecture for ECC over prime fields by using the coarsely integrated operand scanning (CIOS) method of Montgomery multiplication. The architecture is available for any prime-field curve, but the occupied area is large. Varchola et al. [21] optimized the computational unit using fast reduction for the NIST primes, and obtained a very high frequency and relative small area (773 Slices in Virtex-2 Pro). Targeting generic curves, our optimized ECC architecture has a significant lower area than these two works. As Virtex-2 family FPGAs are not recommended by Xilinx, we choose Spartan-3E FPGA, which has a similar structure of Virtex-2 Pro, to perform the comparison. In our implementation, only 350 Slices with 2 MULTs and 2 BRAMs are consumed, thus more than half of the resources are saved in the comparison with [22] and [21]. Roy et al. [19] recently present a single instruction based light ECC processor coupled with dedicated hardcores of the FPGAs for NIST P-256. The implementation only occupies 72 Slices, but the consumed DSP blocks and BRAMs are relatively more (8 DSP blocks and 24 BRAMs). Therefore, our implementation achieves a good balance in the consumed logics and hardcores. Furthermore, thanks to the optimization inside DSP blocks, our ECC implementation is able to run at a high frequency and compute the ECPM for general prime fields with a low latency.

6 Conclusion

In this paper, we propose a very low-cost ECC implementation for general prime fields. In the architecture design, we efficiently integrate all the prime operations into the compact ALU, and specifically optimize its functions for elliptic curve arithmetic. In the architecture implementation, we further adopt platform-targeted optimization techniques, such as pipeline inside DSP blocks and SRL implementation in FPGAs, and this allows us to further reduce the area of the implementation and improve the efficiency. Experimental results indicate that the implementation only occupies 105 Slices, 2 DSP blocks and 2 BRAMs in Spartan-6 FPGA. In future work, we will add the countermeasures into the implementation to resist side channel attacks.