Montgomery Multiplication Using Vector Instructions
Abstract
In this paper we present a parallel approach to compute interleaved Montgomery multiplication. This approach is particularly suitable to be computed on 2way single instruction, multiple data platforms as can be found on most modern computer architectures in the form of vector instruction set extensions. We have implemented this approach for tablet devices which run the x86 architecture (Intel Atom Z2760) using SSE2 instructions as well as devices which run on the ARM platform (Qualcomm MSM8960, NVIDIA Tegra 3 and 4) using NEON instructions. When instantiating modular exponentiation with this parallel version of Montgomery multiplication we observed a performance increase of more than a factor of 1.5 compared to the sequential implementation in OpenSSL for the classical arithmetic logic unit on the Atom platform for 2048bit moduli.
Keywords
Dispatch Neon Padding1 Introduction
Modular multiplication of large integers is a computational building block used to implement publickey cryptography. For schemes like RSA [34], ElGamal [11] or DSA [36], the most common size of the modulus for parameters in use is large; 1024 bits long [20, 28]. The typical modulus size will increase to 2048 and 3072 bits over the coming years, in order to comply with the current 112 and 128bit security standard (cf. [31]). When computing multiple modular multiplications, Montgomery multiplication [30] provides a speed up to this core arithmetic operation. As RSAbased schemes are arguably the most frequently computed asymmetric primitives today, improvements to Montgomery multiplication are of immediate practical importance.
Many modern computer architectures provide vector instruction set extensions in order to perform single instruction, multiple data (SIMD) operations. Example platforms include the popular x86 architecture as well as the ARM platform that can be found in almost all modern smartphones and tablets. The research community has studied ways to reduce the latency of Montgomery multiplication by parallelizing this computation. These approaches vary from using the SIMD paradigm [8, 10, 18, 23] to the single instruction, multiple threads paradigm using a residue number system [14, 29] as described in [4, 19] (see Sect. 2.3 for a more detailed overview).
In this paper we present an approach to split the Montgomery multiplication into two parts which can be computed in parallel. We flip the sign of the precomputed Montgomery constant and accumulate the result in two separate intermediate values that are computed concurrently. This avoids using a redundant representation, for example suggested in the recent SIMD approach for Intel architectures [18], since the intermediate values do not overflow to an additional word. Moreover, our approach is suitable for implementation using vector instruction set extensions which support 2way SIMD operations, i.e., a single instruction that is applied to two data segments simultaneously. We implemented the sequential Montgomery multiplication algorithm using schoolbook multiplication on the classical arithmetic logic unit (ALU) and the parallel approach on the 2way SIMD vector instruction set of both the x86 (SSE2) and the ARM (NEON) processors. Our experimental results show that on both 32bit x86 and ARM platforms, widely available in a broad range of mobile devices, this parallel approach manages to outperform our classical sequential implementation.
Note, that the approach and implementation used in the GNU multiple precision arithmetic library (GMP) [13], is faster than the one presented in this paper and the one used in OpenSSL [32] on some Intel platforms we tested. This approach does not use the interleaved Montgomery multiplication but first computes the multiplication, using asymptotically fast method like Karatsuba [25], followed by the Montgomery reduction. GMP uses dedicated squaring code which is not used in our implementation. Note, however, that GMP is not a cryptographic library and does not strive to provide constanttime implementations. See Sect. 3.1 for a more detailed discussion of the different approaches.
2 Preliminaries
In this section we recall some of the facts related to SIMD instructions and Montgomery multiplication. In Sect. 2.3 we summarize related work of parallel software implementations of Montgomery multiplication.
2.1 SIMD Instruction Set Extensions
Many processors include instruction set extensions. In this work we mainly focus on extensions which support vector instructions following the single instruction, multiple data (SIMD) paradigm. The two platforms we consider are the x86 and the ARM, and the instruction set extensions for these platforms are outlined below. The main vector instructions used in this work (on both processor types) are integer multiply, shift, bitwise AND, addition, and subtraction.
The x86 SIMD Instruction Set Extensions. SIMD operations on x86 and x64 processors have been supported in a number of instruction set extensions, beginning with MMX in 1997. This work uses the streaming SIMD extensions 2 (SSE2) instructions, introduced in 2001. SSE2 has been included on most Intel and AMD processors manufactured since then. We use “SSE” to refer to SSE2. SSE provides 128bit SIMD registers (eight registers on x86 and sixteen registers on x64) which may be viewed as vectors of 1, 8, 16, 32, or 64bit integer elements operating using 128, 16, 8, 4, or 2way SIMD respectively. Vector operations allow multiple arithmetic operations to be performed simultaneously, for example PMULLUDQ multiplies the low 32bits of a pair of 64bit integers and outputs a pair of 64bit integers. For a description of SSE instructions, see [22].
2.2 Montgomery Arithmetic
Montgomery arithmetic [30] consists of transforming operands into a Montgomery representation, performing the desired computations on these transformed numbers, then converting the result (also in Montgomery representation) back to the regular representation. Due to the overhead of changing representations, Montgomery arithmetic is best when used to replace a sequence of modular multiplications, since the overhead is amortized.
The idea behind Montgomery multiplication is to replace the expensive division operations required when computing the modular reduction by cheap shift operations (division by powers of two). Let \(w\) denote the word size in bits. We write integers in a radix \(r\) system, for \(r=2^w\) where typical values of \(w\) are \(w=32\) or \(w=64\). Let \(M\) be an \(n\)word odd modulus such that \(r^{n1}\le M<r^n\). The Montgomery radix \(r^n\) is a constant such that \(\gcd (r^n, M) = 1\). The Montgomery residue of an integer \(A\in \mathbf{{Z}}/M\mathbf{{Z}}\) is defined as \(\widetilde{A}=A \cdot r^n {~\mathrm {mod}~}M\). The Montgomery product of two residues is defined as \(M(\widetilde{A}, \widetilde{B}) = \widetilde{A} \cdot \widetilde{B} \cdot r^{n} {~\mathrm {mod}~}M\). Algorithm 1 outlines interleaved Montgomery multiplication, denoted as coarsely integrated operand scanning in [26], where the multiplication and reduction are interleaved. Note that residues may be added and subtracted using regular modular algorithms since \(\widetilde{A}\pm \widetilde{B} \equiv (A\cdot r^n) \pm (B\cdot r^n) \equiv (A\pm B)\cdot r^n \pmod M\).
2.3 Related Work
There has been a considerable amount of work related to SIMD implementations of cryptography. The authors of [6, 12, 35] propose ways to speed up cryptography using the NEON vector instructions. Intel’s SSE2 vector instruction set extension is used to compute pairings in [15] and multiply big numbers in [21]. Simultaneously, people have studied techniques to create hardware and software implementations of Montgomery multiplication. We now summarize some of the techniques to implement Montgomery multiplication concurrently in a software implementation. A parallel software approach describing systolic (a specific arrangement of processing units used in parallel computations) Montgomery multiplication is described in [10, 23]. An approach using the vector instructions on the Cell microprocessor is considered in [8]. Exploiting much larger parallelism using the single instruction multiple threads paradigm, is realized by using a residue number system [14, 29] as described in [4]. This approach is implemented for the massively parallel graphics processing units in [19]. An approach based on Montgomery multiplication which allows one to split the operand into two parts, which can be processed in parallel, is called bipartite modular multiplication and is introduced in [24]. More recently, the authors of [18] describe an approach using the soon to be released AVX2 SIMD instructions, for Intel’s Haswell architecture, which uses 256bit wide vector instructions. The main difference between the method proposed in this work and most of the SIMD approaches referred to here is that we do not follow the approach described in [21]. We do not use a redundant representation to accumulate multiple multiplications. We use a different approach to make sure no extra words are required for the intermediate values (see Sect. 3).
3 Montgomery Multiplication Using SIMD Extensions
Montgomery multiplication, as outlined in Algorithm 1, does not lend itself to parallelization directly. In this section we describe an algorithm capable of computing the Montgomery multiplication using two threads running in parallel which perform identical arithmetic steps. Hence, this algorithm can be implemented efficiently using common 2way SIMD vector instructions. For illustrative purposes we assume a radix\(2^{32}\) system, but this can be adjusted accordingly to other choices of radix.
The second idea is to flip the sign of the Montgomery constant \(\mu \): i.e. instead of using \(M^{1} {~\mathrm {mod}~}{2^{32}}\) (as in Algorithm 1) we use \(\mu =M^{1} {~\mathrm {mod}~}{2^{32}}\) (the reason for this choice is outlined below). When computing the Montgomery product \(C = A\cdot B \cdot 2^{32n} {~\mathrm {mod}~}M\), for an odd modulus \(M\) such that \(2^{32(n1)}\le M < 2^{32n}\), one can compute \(D\), which contains the sum of the products \(a_iB\), and \(E\), which contains the sum of the products \(qM\), separately. Due to our choice of the Montgomery constant \(\mu \) we have \(C=DE\equiv A\cdot B \cdot 2^{32n} \pmod M\), where \(0\le D, E < M\): the maximum values of both \(D\) and \(E\) fit in an \(n\)limb integer, avoiding a carry that might result in an \((n+1)\) limb long integer as in Algorithm 1. This approach is outlined in Algorithm 2.
Except for the computation of \(q\), all arithmetic computations performed by Computation 1 and Computation 2 are identical but work on different data. This makes Algorithm 2 suitable for implementation using 2way 32bit SIMD vector instructions. This approach benefits from 2way SIMD \(32\times 32\rightarrow 64\)bit multiplication and matches exactly the 128bit wide vector instructions as present in SSE and NEON. Changing the radix used in Algorithm 2 allows implementation with larger or smaller vector instructions. For example, if a \(64\times 64 \rightarrow 128\)bit vector multiply instruction is provided in a future version of AVX, implementing Algorithm 2 in a \(2^{64}\)radix system with 256bit wide vector instructions could potentially speedup modular multiplication by a factor of up to two on 64bit systems (see Sect. 3.1).
A simplified comparison, only stating the number of arithmetic operations required, of the expected performance of Montgomery multiplication when using a \(32n\)bit modulus for a positive even integer \(n\). The left side of the table shows arithmetic instruction counts for the sequential algorithm using the classical ALU (Algorithm 1) and when using 2way SIMD instructions with the parallel algorithm (Algorithm 2). The right side of the table shows arithmetic instruction counts when using one level of Karatuba’s method [25] for the multiplication as analyzed in [17]
Instruction  Classical  2way SIMD  Karatsuba  Instruction  

32bit  64bit  32bit  32bit  
add      \(n\)  \(\frac{13}{4}n^2+8n+2\)  add 
sub      \(n\)  \(\frac{7}{4}n^2+n\)  mul 
shortmul  \(n\)  \(\frac{n}{2}\)  \(2n\)  
muladd  \(2n\)  \(n\)    
muladdadd  \(2n(n1)\)  \(n(\frac{n}{2}1)\)    
SIMD muladd      \(n\)  
SIMD muladdadd      \(n(n1)\) 
3.1 Expected Performance
The question remains if Algorithm 2, implemented for a 2way SIMD unit, outperforms Algorithm 1, implemented for the classical ALU. This mainly depends on the size of the inputs and outputs of the integer instructions, how many instructions can be dispatched per cycle, and the number of cycles an instruction needs to complete. In order to give a (simplified) prediction of the performance we compute the expected performance of a Montgomery multiplication using a \(32n\)bit modulus for a positive even integer \(n\). Let \(\mathtt{muladd }_w(e,a,b,c)\) and \(\mathtt{muladdadd }_w(e,a,b,c,d)\) denote the computation of \(e = a\times b + c\) and \(e = a\times b + c + d\), respectively, for \(0\le a, b, c, d < 2^w\) and \(0\le e < 2^{2w}\) as a basic operation on a compute architecture which works on \(w\)bit words. Some platforms have these operations as a single instruction (e.g., on some ARM architectures) or they must be implemented using a multiplication and addition(s) (as on the x86 platform). Furthermore, let \(\mathtt{shortmul }_w(e,a,b)\) denote \(e = a\times b {~\mathrm {mod}~}{2^w}\): this only computes the lower word of the result and can be done faster (compared to a full product) on most platforms.
Table 1 summarizes the expected performance of Algorithm 1 and 2 in terms of arithmetic operations only (e.g., the data movement, shifting and masking operations are omitted). Also the operations required to compute the final conditional subtraction or addition have been omitted. When solely considering the muladd and muladdadd instructions it becomes clear from Table 1 that the SIMD approach uses exactly half of the number of operations compared to the 32bit classical implementation and almost twice as many operations compared to the classical 64bit implementations. However, the SIMD approach requires more operations to compute the value of \(q\) every iteration and has various other overhead (e.g., inserting and extracting values from the vector). Hence, when assuming that all the characteristics of the SIMD and classical (nonSIMD) instructions are identical, which will not be the case on all platforms, then we expect Algorithm 2 running on a 2way 32bit SIMD unit to outperform a classical 32bit implementation using Algorithm 1 by at most a factor of two while being roughly twice as slow when compared to a classical 64bit implementation.
Inherently, the interleaved Montgomery multiplication algorithm (as used in this work) is not compatible with asymptotically faster integer multiplication algorithms like Karatsuba multiplication [25]. We have not implemented the Montgomery multiplication by first computing the multiplication using such faster methods, and then computing the modular reduction, using SIMD vector instructions in one or both steps. In [17], instruction counts are presented when using the interleaved Montgomery multiplication, as used in our baseline implementation, as well as for an approach where the multiplication and reduction are computed separately. Separating these two steps makes it easier to use a squaring algorithm. In [17] a single level of Karatsuba on top of Comba’s method [9] is considered: the arithmetic instruction counts are stated in Table 1. For 1024bit modular multiplication (used for 2048bit RSA decryption using the CRT), the Karatsuba approach can reduce the number of multiplication and addition instructions by a factor 1.14 and 1.18 respectively on 32bit platforms compared to the sequential interleaved approach. When comparing the arithmetic instructions only, the SIMD approach for interleaved Montgomery multiplication is 1.70 and 1.67 times faster than the sequential Karatsuba approach for 1024bit modular multiplication on 32bit platforms. Obviously, the Karatsuba approach can be sped up using SIMD instructions as well.
The results in Table 1 are for Montgomery multiplication only. It is known how to optimize (sequential) Montgomery squaring [16], but as far as we are aware, not how to optimize squaring using SIMD instructions. Following the analysis from [17], the cost of a Montgomery squaring is \(\frac{11n+14}{14n+8}\) and \(\frac{3n+5}{4n+2}\) the cost of a Montgomery multiplication when using the Karatsuba or interleaved Montgomery approach on \(n\)limb integers. For 1024bit modular arithmetic (as used in RSA2048 with \(n=32\)) this results in \(0.80\) (for Karatsuba) and \(0.78\) (for interleaved). For RSA2048, approximately \(5/6\) of all operations are squarings: this highlights the potential of an efficient squaring implementation.
4 Implementation Results
We have implemented interleaved Montgomery modular multiplication (Algorithm 1) as a baseline for comparison with the SIMD version (Algorithm 2). In both implementations, the final addition/subtraction was implemented using masking such that it runs in constant time, to resist certain types of sidechannel attacks using timing and branch prediction. Since the cost of this operation was observed to be a small fraction of the overall cost, we chose not to write a separate optimized implementation for operations using only public values (such as signature verification).
 Intel Xeon E31230.

A quad core 3.2 GHz CPU on an HP Z210 workstation. We used SSE2 for Algorithm 2 and also benchmark x8632 and x8664 implementations of Algorithm 1 for comparison.
 Intel Atom Z2760.

A dual core 1.8 GHz systemonachip (SoC), on an Asus Vivo Tab Smart Windows 8 tablet.
 NVIDIA Tegra T30.

A quad core 1.4 GHz ARM CortexA9 SoC, on an NVIDIA developer tablet.
 Qualcomm MSM8960.

A quad core 1.8 GHz Snapdragon S4 SoC, on a Dell XPS 10 tablet.
 NVIDIA Tegra 4.

A quad core 1.91 GHz ARM CortexA15 SoC, on an NVIDIA developer tablet.
On the Xeon system, Intel’s Turbo Boost feature will dynamically increase the frequency of the processor under high computational load. We found Turbo Boost had a modest impact on our timings. Since it is a potential source of variability, all times reported here were measured with Turbo Boost disabled.
Benchmarks. We chose to benchmark the cost of modular multiplication for 512bit, 1024bit and 2048bit moduli, since these are currently used in deployed cryptography. The 512bit modular multiplication results may also be interesting for usage in elliptic curve and pairing based cryptosystems. We created implementations optimized for these “special” bitlengths as well as generic implementations, i.e., implementations that operate with arbitrary length inputs. For comparison, we include the time for modular multiplication with 1024 and 2048bit generic implementations. Our x64 baseline implementation has no lengthspecific code (we did not observe performance improvements).
We also benchmark the cost of RSA encryption and decryption using the different modular multiplication routines. We do not describe our RSA implementation in detail, because it is the same for all benchmarks, but note that: (i) decryption with an \(n\)bit modulus is done with \(n/2\)bit arithmetic using the Chinese remainder theorem approach, (ii) this is a “raw” RSA operation, taking an integer as plaintext input, no padding is performed, (iii) no specialized squaring routine is used, and (iv) the public exponent in our benchmarks is always \(2^{16}+1\). We compute the modular exponentiation using a windowing based approach. As mentioned in (iii), we have not considered a specialized Montgomery squaring algorithm for the sequential or the SIMD algorithms. Using squaring routines can significantly enhance the performance of our code as discussed in Sect. 3.1.
Implementation timings in microseconds and cycles for x86/x64 based processors. The “ratio” column is baseline/SIMD. The 512 g, 1024 g and 2048 g rows are generic implementations that do not optimize for a specific bitlength.
Benchmark  Xeon x86  Xeon x64  Atom (x86)  

Baseline  SIMD  Ratio  Baseline  SIMD  Ratio  Baseline  SIMD  Ratio  
modmul 512  1.229  0.805  1.53  0.498  0.805  0.62  5.948  4.317  1.38 
(cycles)  3933  2577  1.53  1598  2577  0.62  10706  7775  1.38 
modmul 1024  3.523  1.842  1.91  1.030  1.842  0.56  21.390  12.388  1.73 
(cycles)  11255  5887  1.91  3295  5887  0.56  38479  22288  1.73 
RSA enc 1024  75.459  36.745  2.05  16.411  36.745  0.45  407.835  250.285  1.63 
(cycles)  241014  117419  2.05  52457  117419  0.45  733224  450092  1.63 
RSA dec 1024  1275.030  656.831  1.94  278.444  656.831  0.42  6770.646  4257.838  1.59 
(cycles)  4070962  2097258  1.94  889103  2097258  0.42  12167933  7652178  1.59 
modmul 2048  13.873  5.488  2.53  3.012  5.488  0.55  72.870  41.402  1.76 
(cycles)  44302  17529  2.53  9621  17529  0.55  130975  74425  1.76 
RSA enc 2048  277.719  129.876  2.14  56.813  129.876  0.44  1437.459  891.185  1.61 
(cycles)  886828  414787  2.14  181412  414787  0.44  2583643  1601878  1.61 
RSA dec 2048  8231.233  3824.690  2.15  1543.666  3824.690  0.40  44629.140  28935.088  1.54 
(cycles)  26280725  12211700  2.15  4928633  12211700  0.40  80204317  52000367  1.54 
modmul 512g  1.356  0.986  1.38  0.498  0.986  0.51  6.387  5.116  1.25 
(cycles)  4336  3155  1.37  1598  3155  0.51  11496  9213  1.25 
modmul 1024g  4.111  2.534  1.62  1.030  2.534  0.41  25.362  13.560  1.87 
(cycles)  13132  8098  1.62  3295  8098  0.41  45631  24393  1.87 
modmul 2048g  15.607  9.304  1.68  3.012  9.304  0.32  74.212  44.806  1.66 
(cycles)  49838  29714  1.68  9621  29714  0.32  133387  80543  1.66 
Implementation timings in microseconds for ARMbased processors. The “ratio” column is baseline/SIMD. The 512 g, 1024 g and 2048 g rows are generic implementations that do not optimize for a specific bitlength.
Benchmark  Snapdragon S4  Tegra 4  Tegra 3  

Baseline  SIMD  Ratio  Baseline  SIMD  Ratio  Baseline  SIMD  Ratio  
modmul 512  4.097  3.384  1.21  1.976  2.212  0.89  3.553  5.265  0.67 
(cycles)  6443  5372  1.20  3658  4020  0.91  4678  6861  0.68 
modmul 1024  10.676  7.281  1.47  8.454  8.622  0.98  9.512  15.891  0.60 
(cycles)  16382  11243  1.46  10351  10560  0.98  12314  20490  0.60 
RSA enc 1024  198.187  142.956  1.38  168.617  179.227  0.94  189.420  295.110  0.64 
(cycles)  302898  219244  1.38  195212  207647  0.94  245167  379736  0.65 
RSA dec 1024  3424.413  2475.716  1.38  1999.211  2303.588  0.87  3306.230  5597.280  0.59 
(cycles)  5179365  3746371  1.38  3288177  3332262  0.99  4233862  7166897  0.59 
modmul 2048  36.260  21.531  1.68  30.465  32.064  0.95  31.912  55.070  0.58 
(cycles)  55260  32978  1.68  37185  36984  1.01  41004  70655  0.58 
RSA enc 2048  716.160  467.713  1.53  593.326  617.758  0.96  679.920  1060.050  0.64 
(cycles)  1087318  710910  1.53  725336  712542  1.02  872468  1358955  0.64 
RSA dec 2048  22992.576  14202.886  1.62  19024.405  19797.988  0.96  21519.880  36871.550  0.58 
(cycles)  34769147  21478047  1.62  23177617  22812040  1.02  27547434  47205919  0.58 
modmul 512 g  4.586  4.149  1.11  2.187  2.798  0.78  4.108  6.177  0.67 
(cycles)  7179  6627  1.08  4045  5166  0.78  5383  8029  0.67 
modmul 1024 g  12.274  9.697  1.27  8.973  12.151  0.74  12.112  19.421  0.62 
(cycles)  18795  14894  1.26  10984  14870  0.74  15652  25004  0.63 
modmul 2048 g  40.554  30.743  1.32  31.959  44.841  0.71  40.494  69.009  0.59 
(cycles)  61621  46945  1.31  37786  51693  0.73  51993  88500  0.59 
ARM Results. On ARM our results are more mixed (see Table 3). First we note that on the Tegra 3 SoC, our NEON implementation of Algorithm 2 is consistently worse than the baseline, almost twice as slow. Going back to our analysis in Sect. 3.1, this would occur if the cost of a vector multiply instruction (performing two 32bit multiplies) was about the cost of two nonvector multiply instructions. This is (almost) the case according to the CortexA9 instruction latencies published by ARM.^{2} Our efforts to pipeline multiple vector multiply instructions did not sufficiently pay off – the lengthspecific implementations give a \(1.27\) factor speedup over the generic implementations, roughly the same speedup obtained when we optimize the baseline for a given bitlength (by fully unrolling the inner loop).
On the newer ARM SoCs in our experiments, the S4 and Tegra 4, the results are better. On the Snapdragon S4 the SIMD implementation is consistently better than the baseline. The NEON lengthspecific implementations were especially important and resulted in a speedup by a factor of \(1.30\) to \(1.40\) over generic implementations, while optimizing the baseline implementation for a specific length was only faster by a factor slightly above \(1.10\). This is likely due to the inability of the processor to effectively reorder NEON instructions to minimize pipeline stalls – the main difference in our lengthspecific implementation was to partially unroll the inner loop and reorder instructions to use more registers and pipeline four multiply operations.
Performance of the SIMD algorithm on the Tegra 4 was essentially the same as the baseline performance. This is a solid improvement in NEON performance compared to our benchmarks on the Tegra 3, however the Tegra 4’s NEON performance still lags behind the S4 (for the instructions used in our benchmarks). We suspect (based on informal experiments) that an implementation of Algorithm 2 specifically optimized for the Tegra 4 could significantly outperform the baseline, but still would not be comparable to the S4.
Performance results expressed in cycles of RSA 1024bit and 2048bit encryption (enc) and decryption (dec). The first four performance numbers have been obtained from eBACS: ECRYPT Benchmarking of Cryptographic Systems [5] while the fifth row corresponds to running the performance benchmark suite of OpenSSL [32] on the same Atom device used to obtain the performance results in Table 2. The last two rows correspond to running GMP on our Atom and Xeon (in 32bit mode)
Platform  RSA 1024  RSA 2048  

Enc  Dec  Enc  Dec  
ARM – Tegra 250 (1000 MHz)  261677  11684675  665195  65650103 
ARM – Snapdragon S3 (1782 MHz)  276836  7373869  609593  39746105 
x86 – Atom N280 (1667 MHz)  315620  13116020  871810  81628170 
x64 – Xeon E31225 (3100 MHz)  49652  1403884  103744  6158336 
x86 – Atom Z2760 (1800 MHz)  610200  10929600  2323800  75871800 
x86 – Atom Z2760 (1800 MHz)  305545  5775125  2184436  37070875 
x86 – Xeon E31230 (3200 MHz)  106035  1946434  695861  11929868 
4.1 Comparison to Previous Work
Comparison to eBACS and OpenSSL. We have compared our SIMD implementation of the interleaved Montgomery multiplication algorithm to our baseline implementation of this method. To show that our baseline is competitive and put our results in a wider context, we compare to benchmark results from eBACS: ECRYPT Benchmarking of Cryptographic Systems [5] and to OpenSSL [32]. Table 4 summarizes the cycle counts from eBACS on platforms which are close to the ones we consider in this work, and also includes the results of running the performance benchmark of OpenSSL 1.0.1e [32] on our Atom device. As can be seen from Table 4, our baseline implementation results from Table 2 and 3 are similar (except for 1024bit RSA decryption, which our implementation does not optimize, as mentioned above).
Comparison to GMP. The implementation in the GNU multiple precision arithmetic library (GMP) [13] is based on the noninterleaved Montgomery multiplication. This means the multiplication is computed first, possibly using a asymptotically faster algorithm than schoolbook, followed by the Montgomery reduction (see Sect. 3.1). The last two rows in Table 4 summarize performance numbers for our Atom and Xeon (in 32bit mode) platforms. The GMP performance numbers for RSA2048 decryption on the Atom (37.1 million) are significantly faster compared to OpenSSL (75.9 million), our baseline (80.2 million) and our SIMD (52.0 million) implementations. On the 32bit Xeon the performance of the GMP implementation, which uses SIMD instructions for the multiplication and has support for asymptotically faster multiplication algorithms, is almost identical to our SIMD implementation which uses interleaved Montgomery multiplication. Note that both OpenSSL and our implementations are designed to resist sidechannel attacks, and run in constant time, while both the GMP modular exponentiation and multiplication are not, making GMP unsuitable for use in many cryptographic applications. The multiplication and reduction routines in GMP can be adapted for cryptographic purposes but it is unclear at what performance price. From Table 2, it is clear that our SIMD implementation performs better on the 32bit Xeon than on the Atom. The major difference between these two processors is the instruction scheduler (inorder on the Atom and outoforder on the Xeon).
4.2 Engineering Challenges
In this section we discuss some engineering challenges we had to overcome in order to use SIMD in practice. Our goal is an implementation that is efficient and supports multiple processors, but is also maintainable. The discussion here may not be applicable in other circumstances.
ASM or Intrinsics? There are essentially two ways to access the SIMD instructions directly from a C program. One either writes assembly language (ASM), or uses compiler intrinsics. Intrinsics are macros that the compiler translates to specific instructions, e.g., on ARM, the Windows RT header file arm_neon.h defines the intrinsic vmull_u32, which the compiler implements with the vmull instruction. In addition to instructions, the header also exposes special data types corresponding to the 64 and 128bit SIMD registers. We chose to use intrinsics for our implementation, for the following reasons. C with intrinsics is easier to debug, e.g., it is easier to detect mistakes using assertions. Furthermore, while there is a performance advantage for ASM implementations, these gains are limited in comparison to a careful C implementation with intrinsics (in our experience). In addition ASM is difficult to maintain. For example, in ASM the programmer must handle all argument passing and set up the stack frame, and this depends on the calling conventions. If calling conventions are changed, the ASM will need to be rewritten, rather than simply recompiled. Also, when writing for the Microsoft Visual Studio Compiler, the compiler automatically generates the code to perform structured exception handling (SEH), which is an exception handling mechanism at the system level for Windows and a requirement for all code running on this operating system. Incorrect implementation of SEH code may result in security bugs that are often difficult to detect until they are used in an exploit. Also, compiler features such as Whole Program Optimization and Link Time Code generation, that optimize code layout and timememory usage tradeoffs, will not work correctly on ASM.
Despite the fact that one gets more control of the code (e.g. register usage) when writing in ASM, using instrinsics and C can still be efficient. Specifically, we reviewed the assembly code generated by the compiler to ensure that the runtime of this code remains in constant time and register usage is as we expected. In short, we have found that ASM implementations require increased engineering time and effort, both in initial development and maintenance, for a relatively small gain in performance. We have judged that this trade off is not worthwhile for our implementation.
simd.h Abstraction Layer. Both SSE2 and NEON vector instructions are accessible as intrinsics, however, the types and instructions available for each differ. To allow a single SIMD implementation to run on both architectures, we abstracted a useful subset of SSE2 and NEON in header named simd.h. Based on the architecture, this header defines inline functions wrapping a processorspecific intrinsic. simd.h also refines the vector data types, e.g., the type simd32x2p_t stores two 32bit unsigned integers in a 64bit register on ARM, but on x86 stores them in a 128bit integer (in bits 0–31 and 64–95), so that they are in the correct format for the vector multiply instruction (which returns a value of type simd64x2_t on both architectures). The compiler will check that the arguments to the simd.h functions match the prototype, something that is not possible with intrinsics (which are preprocessor macros). While abstraction layers are almost always technically possible, we find it noteworthy that in this case it can be done without adding significant overhead, and code using the abstraction performs well on multiple processors. With simd.h containing all of architecturespecific code, the SIMD timings in the tables above were generated with two implementations: a generic one, and a lengthspecific one that requires the number of limbs in the modulus be divisible by four, to allow partial unrolling of the inner loop of Algorithm 2.
LengthSpecific Routines. Given the results from Tables 2 and 3, it is clear that having specialized routines for certain bitlengths is worthwhile. In a math library used to implement multiple crypto primitives, each supporting a range of allowed keysizes, routines for arbitrary length moduli are required as well. This raises the question of how to automatically select one of multiple implementations. We experimented with two different designs. The first design stores a function pointer to the modular multiplication routine along with the modulus. The second uses a function pointer to a lengthspecific exponentiation routine. On the x86 and x64 platforms, with 1024bit (and larger) operands, the performance difference between the two approaches is small (the latter was faster by a factor around \(1.10\)), however on ARM, using function pointers to multiplication routines is slower by a factor of up to \(1.30\) than when using pointers to exponentiation routines. The drawback of this latter approach is the need to maintain multiple exponentiation routines.
SoCSpecific Routines. Our experiments with multiple ARM SoCs also show that performance can vary by SoC. This is expected, however we were surprised by the range observed, compared to x86/x64 processors which are more homogeneous. We also observed that small code changes can result in simultaneous speed improvements on one SoC, and regression on another. Our current implementation performs a runtime check to identify the SoC, to decide whether to use Algorithm 1 or 2. Our results highlight that there is a great deal of variability between different implementations of the ARM architecture and that, for the time being, it is difficult to write code that performs well on multiple ARM SoCs simultaneously. This also implies that published implementation results for one ARM microprocessor core give little to no information on how it would perform on another. For more information, see the ARM technical reference manuals [3].
5 Conclusions and Future Work
In this paper we present a parallel version of the interleaved Montgomery multiplication algorithm that is amenable to implementation using widely available SIMD vector extension instructions (SSE2 and NEON). The practical implications of this approach are highlighted by our performance results on common tablet devices. When using 2048bit moduli we are able to outperform our sequential implementation using the schoolbook multiplication method by a factor of 1.68 to 1.76 on both 32bit x86 and ARM processors.
The performance numbers agree with our analysis that a 2way SIMD implementation using 32bit multipliers is not able to outperform a classical interleaved Montgomery multiplication implementation using 64bit multiplication instructions. Hence, we also conclude that it would be beneficial for new 256bit SIMD instruction sets to include 2way integer multipliers. For example, our results suggest that modular multiplication could be spedup by up to a factor of two on x64 systems if a future set of AVX instructions included a \(64\times 64\rightarrow 128\)bit 2way SIMD multiplier.
It remains of independent interest to study ways to use both asymptotically faster integer multiplication methods (like Karatsuba) and Montgomery reduction using SIMD instructions to reduce latency, including sidechannel protections. This is left as future work. Furthermore, as pointed out by an anonymous reviewer, another possibility might be to compute the proposed parallel Montgomery multiplication routine using both the integer and floating point unit instead of using vector instructions.
Footnotes
Notes
Acknowledgements
The authors would like to thank: Adam Glass for discussions on ARM SoCs; Patrick Longa for comments on baseline implementations and general help; Jason Mackay for catching mistakes in early drafts; Paul Schofield for help timing on the Tegra 4; and Niels Ferguson for discussions of SIMD. Also, we thank the anonymous reviewers of SAC for their helpful feedback and thank Daniel J. Bernstein and Tanja Lange for the additional suggestions, both of which improved the quality of this paper.
References
 1.ARM. CortexA9. Technical Reference Manual (2010). Version r2p2Google Scholar
 2.ARM. CortexA9 NEON Media Processing Engine. Technical Reference Manual (2012). Version r4p1Google Scholar
 3.ARM Limited. ARM Architechture Reference Manual ARMv7A and ARMv7R edition (2010)Google Scholar
 4.Bajard, J.C., Didier, L.S., Kornerup, P.: An RNS Montgomery modular multiplication algorithm. IEEE Trans. Comput. 47(7), 766–776 (1998)CrossRefMathSciNetGoogle Scholar
 5.Bernstein, D.J., Lange, T. (eds).: eBACS: ECRYPT Benchmarking of Cryptographic Systems. http://bench.cr.yp.to. Accessed 2 July 2013
 6.Bernstein, D.J., Schwabe, P.: NEON crypto. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 320–339. Springer, Heidelberg (2012) Google Scholar
 7.Bos, J.W.: Highperformance modular multiplication on the cell processor. In: Hasan, M.A., Helleseth, T. (eds.) WAIFI 2010. LNCS, vol. 6087, pp. 7–24. Springer, Heidelberg (2010) Google Scholar
 8.Bos, J.W., Kaihara, M.E.: Montgomery multiplication on the cell. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 477–485. Springer, Heidelberg (2010) Google Scholar
 9.Comba, P.G.: Exponentiation cryptosystems on the IBM PC. IBM Syst. J. 29(4), 526–538 (1990)CrossRefGoogle Scholar
 10.Dixon, B., Lenstra, A.K.: Massively parallel elliptic curve factoring. In: Rueppel, R.A. (ed.) EUROCRYPT 1992. LNCS, vol. 658, pp. 183–193. Springer, Heidelberg (1993)Google Scholar
 11.ElGamal, T.: A public key cryptosystem and a signature scheme based on discrete logarithms. In: Blakley, G., Chaum, D. (eds.) CRYPTO 1984. LNCS, vol. 196, pp. 10–18. Springer, Heidelberg (1985)Google Scholar
 12.FazHernandez, A., Longa, P., Sanchez, A.H.: Efficient and secure algorithms for GLVbased scalar multiplication and their implementation on GLVGLS curves. Cryptology ePrint Archive, Report 2013/158 (2013). http://eprint.iacr.org/. CT\_RSA. doi:10.1007/9783319048529_1
 13.Free Software Foundation, Inc. GMP: The GNU Multiple Precision Arithmetic Library (2013). http://www.gmplib.org/
 14.Garner, H.L.: The residue number system. IRE Trans. Electron. Comput. 8, 140–147 (1959)CrossRefGoogle Scholar
 15.Grabher, P., Großschädl, J., Page, D.: On software parallel implementation of cryptographic pairings. In: Avanzi, R.M., Keliher, L., Sica, F. (eds.) SAC 2008. LNCS, vol. 5381, pp. 35–50. Springer, Heidelberg (2009) Google Scholar
 16.Großschädl, J.: Architectural support for long integer modulo arithmetic on RISCbased smart cards. Int. J. High Perform. Comput. Appl.  IJHPCA 17(2), 135–146 (2003)CrossRefGoogle Scholar
 17.Großschädl, J., Avanzi, R.M., Savaş, E., Tillich, S.: Energyefficient software implementation of long integer modular arithmetic. In: Rao, J.R., Sunar, B. (eds.) CHES 2005. LNCS, vol. 3659, pp. 75–90. Springer, Heidelberg (2005) Google Scholar
 18.Gueron, S., Krasnov, V.: Software implementation of modular exponentiation, using advanced vector instructions architectures. In: Özbudak, F., RodríguezHenríquez, F. (eds.) WAIFI 2012. LNCS, vol. 7369, pp. 119–135. Springer, Heidelberg (2012) Google Scholar
 19.Harrison, O., Waldron, J.: Efficient acceleration of asymmetric cryptography on graphics hardware. In: Preneel, B. (ed.) AFRICACRYPT 2009. LNCS, vol. 5580, pp. 350–367. Springer, Heidelberg (2009) CrossRefGoogle Scholar
 20.Holz, R., Braun, L., Kammenhuber, N., Carle, G.: The SSL landscape: a thorough analysis of the x.509 PKI using active and passive measurements. In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, IMC ’11, pp. 427–444. ACM (2011)Google Scholar
 21.Intel Corporation. Using streaming SIMD extensions (SSE2) to perform big multiplications. Whitepaper AP941 (2000). http://software.intel.com/file/24960
 22.Intel Corporation. Intel 64 and IA32 Architectures Software Developers Manual (Combined Volumes 1, 2A, 2B, 2C, 3A, 3B and 3C) (2013). http://download.intel.com/products/processor/manual/325462.pdf
 23.Iwamura, K., Matsumoto, T., Imai, H.: Systolicarrays for modular exponentiation using montgomery method. In: Rueppel, R.A. (ed.) EUROCRYPT 1992. LNCS, vol. 658, pp. 477–481. Springer, Heidelberg (1993)Google Scholar
 24.Kaihara, M.E., Takagi, N.: Bipartite modular multiplication method. IEEE Trans. Comput. 57(2), 157–164 (2008)CrossRefMathSciNetGoogle Scholar
 25.Karatsuba, A.A., Ofman, Y.: Multiplication of manydigital numbers by automatic computers. Proc. USSR Acad. Sci. 145, 293–294 (1962)Google Scholar
 26.Koc, K., Acar, T., Kaliski Jr, B.S.: Analyzing and comparing montgomery multiplication algorithms. IEEE Micro 16(3), 26–33 (1996)CrossRefGoogle Scholar
 27.Kocher, P.C.: Timing attacks on implementations of DiffieHellman, RSA, DSS and other systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp. 104–113. Springer, Heidelberg (1996) Google Scholar
 28.Lenstra, A.K., Hughes, J.P., Augier, M., Bos, J.W., Kleinjung, T., Wachter, C.: Public keys. In: SafaviNaini, R., Canetti, R. (eds.) CRYPTO 2012. LNCS, vol. 7417, pp. 626–642. Springer, Heidelberg (2012) CrossRefGoogle Scholar
 29.Merrill, R.D.: Improving digital computer performance using residue number theory. IEEE Trans. Electron. Comput. EC–13(2), 93–101 (1964)CrossRefGoogle Scholar
 30.Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)CrossRefMATHGoogle Scholar
 31.National Institute of Standards and Technology. Special publication 800–57: Recommendation for key management part 1: General (revision 3). http://csrc.nist.gov/publications/nistpubs/80057/sp80057_part1_rev3_general.pdf
 32.OpenSSL. The open source toolkit for SSL/TLS (2013)Google Scholar
 33.Page, D., Smart, N.P.: Parallel cryptographic arithmetic using a redundant Montgomery representation. IEEE Trans. Comput. 53(11), 1474–1482 (2004)CrossRefGoogle Scholar
 34.Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures and publickey cryptosystems. Commun. ACM 21, 120–126 (1978)CrossRefMATHMathSciNetGoogle Scholar
 35.Sánchez, A.H., RodríguezHenríquez, F.: NEON implementation of an attributebased encryption scheme. In: Jacobson, M., Locasto, M., Mohassel, P., SafaviNaini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 322–338. Springer, Heidelberg (2013) Google Scholar
 36.U.S. Department of Commerce/National Institute of Standards and Technology. Digital Signature Standard (DSS). FIPS1863 (2009). http://csrc.nist.gov/publications/fips/fips1863/fips_1863.pdf