Four\(\mathbb {Q}\): FourDimensional Decompositions on a \(\mathbb {Q}\)curve over the Mersenne Prime
Abstract
We introduce Four\(\mathbb {Q}\), a highsecurity, highperformance elliptic curve that targets the 128bit security level. At the highest arithmetic level, cryptographic scalar multiplications on Four\(\mathbb {Q}\) can use a fourdimensional GallantLambertVanstone decomposition to minimize the total number of elliptic curve group operations. At the group arithmetic level, Four\(\mathbb {Q}\) admits the use of extended twisted Edwards coordinates and can therefore exploit the fastest known elliptic curve addition formulas over large prime characteristic fields. Finally, at the finite field level, arithmetic is performed modulo the extremely fast Mersenne prime \(p=2^{127}1\). We show that this powerful combination facilitates scalar multiplications that are significantly faster than all prior works. On Intel’s Haswell, Ivy Bridge and Sandy Bridge architectures, our software computes a variablebase scalar multiplication in 59,000, 71,000 cycles and 74,000 cycles, respectively; and, on the same platforms, our software computes a DiffieHellman shared secret in 92,000, 110,000 cycles and 116,000 cycles, respectively.
Keywords
Shared Secret Scalar Multiplication Main Loop Elliptic Curve Cryptography Operation Count1 Introduction
This paper introduces a new, complete twisted Edwards [5] curve \(\mathcal {E}(\mathbb {F}_{p^2}):\,x^2+y^2=1+dx^2y^2\), where p is the Mersenne prime \(p=2^{127}1\), and d is a nonsquare in \(\mathbb {F}_{p^2}\). This curve, dubbed “Four\(\mathbb {Q}\)”, arises as a special instance of recent constructions using \(\mathbb {Q}\)curves [27, 46], and is thus equipped with an endomorphism \(\psi \) related to the ppower Frobenius map. In addition, it has complex multiplication (CM) by the order of discriminant \(D=40\), meaning it comes equipped with another efficient, lowdegree endomorphism \(\phi \) [47].
We built an elliptic curve cryptography (ECC) library that works inside the cryptographic subgroup \(\mathcal {E}(\mathbb {F}_{p^2})[N]\), where N is a 246bit prime. The endomorphisms \(\psi \) and \(\phi \) do not give any practical speedup to Pollard’s rho algorithm [42], which means the best known attack against the elliptic curve discrete logarithm problem (ECDLP) on \(\mathcal {E}(\mathbb {F}_{p^2})[N]\) requires around \(\sqrt{\pi N/4} \sim 2^{122.5}\) group operations on average. Thus, the cryptographic security of \(\mathcal {E}\) (see Sect. 2.3 for more details) is closely comparable to other curves that target the 128bit security level, e.g., [6, 9, 21, 37].

Speed: Four\({\mathbb {Q}}\)’s library computes scalar multiplications significantly faster than all known software implementations of curvebased cryptographic primitives. It uses the endomorphisms \(\psi \) and \(\phi \) to accelerate scalar multiplications via fourdimensional GallantLambertVanstone (GLV)style [22] decompositions. Fourdimensional decompositions have been used before [9, 32, 37], but not over the Mersenne prime^{1}; this choice of field is significantly faster than any neighboring fields and several works have studied its arithmetic [13, 21, 36]. The combination of extremely fast modular reductions and fourdimensional scalar decompositions makes for highly efficient scalar multiplications on \(\mathcal {E}\). Furthermore, we can exploit the fastest known addition formulas for elliptic curves over large characteristic fields [31], which are complete on \(\mathcal {E}\) since the above d is nonsquare [31, Sect. 3]. In Sect. 2, we explain why fourdimensional decompositions and this special underlying field were not previously partnered at the 128bit security level.

Simplicity and concrete correctness: Simplicity is a major priority in this work and in the development of our software; in some cases we sacrifice speed enhancements in order to design a more simple and compact algorithm (cf. Sect. 4.2).
On input of any point \(P \in \mathcal {E}(\mathbb {F}_{p^2})[N]\), validated as in [14, Appendix A] if necessary, and any integer scalar \(m \in [0,2^{256})\), our software does the following (strictly in constanttime and without exception): 1.
Computes \(\phi (P)\), \(\psi (P)\) and \(\psi (\phi (P))\) using exactly^{2} \(68 \mathbf{M}\), \(27 \mathbf{S}\) and \(49.5 \mathbf{A}\) – see Sect. 3.
 2.
Decomposes m (e.g., in less than 200 Sandy Bridge cycles) into a multiscalar \((a_1,a_2,a_3,a_4) \in \mathbb {Z}^4\) such that each \(a_i\) is positive and at most 64 bits – see Sect. 4.
 3.
Recodes the multiscalar (e.g., in less than 800 Sandy Bridge cycles) to ensure a simple and constanttime main loop – see Sect. 5.
 4.
Computes a lookup table of 8 elements using exactly 7 complete additions, before executing the main loop using exactly 64 complete twisted Edwards doubleandadd operations, and finally outputting \([m]P = [a_1]P+[a_2]\phi (P)+[a_3]\psi (P) + [a_4]\psi \phi (P)\) – see Sect. 5.
This paper details each of the above steps explicitly, culminating in the full routine presented in Algorithm 2. Several prior works exploiting scalar decompositions have potential points of failure (cf. [30, Sect. 7], and Sect. 4.2), but crucially, and for the first time in the setting of fourdimensional decompositions, we accompany our routine with a robust proof of correctness – see Theorem 1.
 1.

Cryptographic versatility: Four\(\mathbb {Q}\) is intended to be used in the same way, i.e., using the same model, same coordinates and same explicit formulas, irrespective of the cryptographic protocol or nature of the intended scalar multiplication. Unlike implementations using ladders [4, 6, 9, 23], Four\(\mathbb {Q}\) supports fast variablebase and fast fixedbase scalar multiplications, both of which use twisted Edwards coordinates; this serves as a basis for fast (ephemeral) DiffieHellman key exchange and fast Schnorrlike signatures. The presence of a single, complete addition law gives implementers the ability to easily wrap higherlevel software and protocols around the Four\(\mathbb {Q}\)’s library exactly as is.

Public availability: Prior works exploiting fourdimensional decompositions have either made code available that did not attempt to run in constanttime [9], or not published code that did run in constanttime [18, 37]. Our library, which is publicly available [15], is largely written in portable C and includes two modular implementations of the arithmetic over \(\mathbb {F}_{p^2}\): a portable implementation written in C and a highperformance implementation for x64 platforms written in C and optional x64 assembly. The library also permits to select (at build time) whether the efficiently computable endomorphisms \(\psi \) and \(\phi \) can be used or not for computing generic scalar multiplications. The code is accompanied by Magma scripts that can be used to verify the proofs of all claims and the claimed operation counts. Our aim is to make it easy for subsequent implementers to replicate the routine and, if desired, develop specialized code that is tailored to specific platforms for further performance gains or with different memory constraints.
When the NIST curves [40] were standardized in 1999, many of the landmark discoveries in ECC (e.g., [17, 21, 22, 46]) were yet to be made. Four\(\mathbb {Q}\) and its accompanying library represent the culmination of several of the best known ECC optimizations to date: it pulls together the extremely fast Mersenne prime, the fastest known large characteristic addition formulas [31], and the highest degree of scalar decompositions (there is currently no known way of achieving higher dimensional decompositions without exposing the ECDLP to attacks that are asymptotically much faster than Pollard rho). Subsequently, for generic scalar multiplications, Four\(\mathbb {Q}\) performs around four to five times faster than the original NIST P256 curve [26], between two and three times faster than curves that are currently under consideration as NIST alternatives, e.g., Curve25519 [4], and is also significantly faster than all of the other curves used to set previous speed records (see Sect. 6 for the comparisons). Interestingly, Four\(\mathbb {Q}\) is still highly efficient if the endomorphisms \(\psi \) and \(\phi \) are not used at all for computing generic scalar multiplications. In this case, Four\(\mathbb {Q}\) performs about three times faster than the NIST P256 curve and up to 1.5 times faster than Curve25519.
It is our belief that the demand for highperformance cryptography warrants the stateoftheart in ECC to be part of the standardization discussion: this paper ultimately demonstrates the performance gains that are possible if such a curve was to be considered alongside the “conservative” choices.
The extended version. For space considerations, we have omitted the proofs of Propositions 1, 2, 4 and 5, Lemma 1 and Theorem 1, as well as several additional remarks. All of these, along with an appendix covering point validation, can be found in the extended version of this article [14].
2 The Curve: Four\(\mathbb {Q}\)
This section describes the proposed curve, where we adopt Smith’s notation [44, 46] for the most part. We present the curve parameters in Sect. 2.1, shed some light on how the curve was found in Sect. 2.2, and discuss its cryptographic security in Sect. 2.3. Both Sects. 2.2 and 2.3 discuss that \(\mathcal {E}\) is essentially oneofakind, illustrating that there were no degrees of freedom in the choice of curve (see [14] for more details).
2.1 A Complete Twisted Edwards Curve
The set of \(\mathbb {F}_{p^2}\)rational points satisfying the affine model for \(\mathcal {E}\) forms a group: the neutral element is \(\mathcal {O}_{\mathcal {E}} = (0,1)\) and the inverse of a point (x, y) is \((x,y)\). The fastest set of explicit formulas for the addition law on \(\mathcal {E}\) are due to Hisil, Wong, Carter and Dawson [31]: they use extended twisted Edwards coordinates to represent the affine point (x, y) on \(\mathcal {E}\) by any projective tuple of the form \((X :Y :Z :T)\) for which \(Z \ne 0\), \(x = X/Z\), \(y=Y/Z\) and \(T=XY/Z\). Since d is not a square in \(\mathbb {F}_{p^2}\), this set of formulas is also complete on \(\mathcal {E}\) (see [5]), meaning that they will work without exception for all points in \(\mathcal {E}(\mathbb {F}_{p^2})\).
2.2 Where did this Curve Come From?
The curve \(\mathcal {E}\) above comes from the family of \(\mathbb {Q}\)curves of degree 2 – originally defined by Hasegawa [29] – that was recently used as one of the example families in Smith’s general construction of \(\mathbb {Q}\)curve endomorphisms [44, 46]. Certain examples of lowdegree \(\mathbb {Q}\)curves (including this family) were independently obtained through a different construction by Guillevic and Ionica [27], who also studied 4dimensional decompositions arising from such curves possessing CM. In fact, \(\mathcal {E}\) has a similar structure to the curve constructed in [27, Exercise 1], but is over the prime \(p=2^{127}1\).
Recall that in this paper we fix \(p=2^{127}1\) for efficiency reasons. For this particular prime p and this family of \(\mathbb {Q}\)curves, Smith’s construction gives rise to precisely p nonisomorphic curves corresponding to each possible choice of \(s \in \mathbb {F}_p\) [46, Proposition 1]. Varying s allows us to readily find curves belonging to this family with strong cryptographic group orders, each of which comes equipped with the endomorphism \(\psi \) that facilitates a twodimensional scalar decomposition.
Proposition 1
Let \(\hat{\mathcal {E}}/K\) and \(\mathcal {E}/K\) be the twisted Edwards curves defined by \(\hat{\mathcal {E}}/K :x^2+y^2 = 1+\hat{d}x^2y^2\) and \(\mathcal {E}/K :x^2+y^2 = 1+dx^2y^2\). If \(d = (1+1/\hat{d})\), then the map \(\tau \, :\, \mathcal {E}\rightarrow \hat{\mathcal {E}}\), \((x,y) \mapsto \left( \frac{2xy}{(x^2+y^2)\sqrt{\hat{d}}} \, , \, \frac{x^2y^2+2}{y^2x^2} \right) \) is a 4isogeny, the dual of which is \(\hat{\tau } \, :\, \hat{\mathcal {E}} \rightarrow \mathcal {E}\), \((x,y) \mapsto \left( \frac{2xy\sqrt{\hat{d}}}{x^2y^2+2} \, , \, \frac{y^2x^2}{y^2+x^2} \right) \).
We note at once that if \(\hat{d}\) is a square in K, then \(\tau \) and \(\hat{\tau }\) are defined over K. Fortunately, while the twisted Edwards curve \(\hat{\mathcal {E}}\) corresponding to \(\mathcal {E}_\mathrm{W}/\mathbb {F}_{p^2}\) has a square constant \(\hat{d}\), our chosen isogenous curve \(\mathcal {E}\) has the nonsquare constant \(d = (1+1/\hat{d})\). Our implementation will work solely in twisted Edwards coordinates on \(\mathcal {E}\), but we will pass back and forth through \(\mathcal {E}_\mathrm{W}\) (via \(\hat{\mathcal {E}}\)) when deriving explicit formulas for the endomorphisms \(\phi \) and \(\psi \) in Sect. 3. We note that Hamburg used 4isogenies (also derived from [1]) to a similar effect in [28].
2.3 The Cryptographic Security of Four\(\mathbb {Q}\)
Pollard’s rho algorithm [42] is the best known way to solve the ECDLP in \(\mathcal {E}(\mathbb {F}_{p^2})[N]\). An optimized version of this attack which uses the negation map [50] requires around \(\sqrt{\pi N/4} \sim 2^{122.5}\) group operations on average. We note that, unlike some of the typical GLV [22] or GLS [21] endomorphisms that can be used to speed up Pollard’s rho algorithm [16], both \(\psi \) and \(\phi \) on \(\mathcal {E}\) do not facilitate any known advantage; neither of these endomorphisms have a small orbit and they are both more expensive to compute than an amortized addition. Thus, the known complexity of the ECDLP on \(\mathcal {E}\) is comparable to various other curves used in the speedrecord literature; optimized implementations of Pollard rho against any of the fastest curves in [4, 9, 13, 18, 21, 37, 41] would require between \(2^{124.8}\) and \(2^{125.8}\) group operations on average. Ideally, we would prefer not to have the factor \(7^2\) dividing \(\#\mathcal {E}(\mathbb {F}_{p^2})\), but the resulting (\(\sim 2.8\) bit) security degradation is a small price to pay for having the fastest field at the 128bit level in conjunction with a fourdimensional scalar decomposition. As we discuss further in [14], it was a long shot to try and find such a cryptographically secure \(\mathbb {Q}\)curve with CM over \(\mathbb {F}_{p^2}\) in Smith’s tables in the first place, let alone one that also had the necessary torsion to support a twisted Edwards model.
Since \(\mathcal {E}(\mathbb {F}_{p^2})\) has rational 2torsion, it is easy to write down the corresponding abelian surface over \(\mathbb {F}_p\) whose Jacobian is isogenous to the Weil restriction of \(\mathcal {E}\) – see [43, Lemma 2.1 and Lemma 3.1]. But since the best known algorithm to solve the discrete logarithm problem on such abelian surfaces is again Pollard’s rho algorithm, the Weil descent philosophy (cf. [24]) does not pose a threat here. Furthermore, the embedding degree of \(\mathcal {E}\) with respect to N is \((N1)/2\), making it infeasible to reduce the ECDLP into a finite field [19, 39].
We note that the largest prime factor dividing the group order of \(\mathcal {E}\)’s quadratic twist is 158 bits, but twistsecurity [4] is not an issue in this work: firstly, our software always validates input points (such validation is essentially free), and secondly, xcoordinateonly arithmetic (which is where twistsecurity makes sense) on \(\mathcal {E}\) is not competitive with a fourdimensional decomposition that uses both coordinates.
In contrast to most currently standardized curves, the proposed curve is both defined over a quadratic extension field and has a small discriminant; one notable exception is secp256k1 in the SEC standard [11], which is used in the Bitcoin protocol and also has small discriminant. However, it is important to note that there is no betterthangeneric attack known to date that can exploit either of these two properties on \(\mathcal {E}\). In fact, with respect to ECDLP difficulty, Koblitz, Koblitz and Menezes [33, Sect. 11] point out that slower, large discriminant curves, like NIST P256 and Curve25519, may turn out to be less conservative than specially chosen curves with small discriminant.
3 The Endomorphisms \(\psi \) and \(\phi \)
In this section we derive explicit formulas for the two endomorphisms on \(\mathcal {E}\). In what follows we use \(c_{i,j,k,l}\) to denote the constant \(i+j\sqrt{2}+k\sqrt{5}+l\sqrt{2}\sqrt{5}\) in \(\mathbb {F}_{p^2}\), which is fixed by setting \(\sqrt{2}:=2^{64}\) and \(\sqrt{5}:=87392807087336976318005368820707244464 \cdot i\).
For both \(\psi \) and \(\phi \), we start by deriving the explicit formulas on the short Weierstrass model \(\mathcal {E}_\mathrm{W}\). As discussed in the previous section, we will pass back and forth between \(\mathcal {E}\) and \(\mathcal {E}_\mathrm{W}\) via the twisted Edwards curve \(\hat{\mathcal {E}}\) that is 4isogenous to \(\mathcal {E}\) over \(\mathbb {F}_{p^2}\). The maps between \(\mathcal {E}\) and \(\hat{\mathcal {E}}\) are given in Proposition 1, and we take the maps \(\delta :\mathcal {E}_\mathrm{W} \rightarrow \hat{\mathcal {E}}\) and \(\delta ^{1} :\hat{\mathcal {E}} \rightarrow \mathcal {E}_\mathrm{W}\) from [46, Sect. 5] (tailored to our \(\hat{\mathcal {E}}\)) as \(\delta \, :(x,y) \mapsto \left( \frac{\gamma (x4)}{y},\frac{x4c_{0,2,0,1}}{x4+c_{0,2,0,1}}\right) \), and \(\delta ^{1} \, :(x,y) \mapsto \left( \frac{c_{0,2,0,1}(y+1)}{1y}+4,\frac{ c_{0,2,0,1}(y+1)\gamma }{x(1y)}\right) \), where \(\gamma ^2=c_{12,4,0,2}\). The choice of the square root \(\gamma \in \mathbb {F}_{p^2}\) becomes irrelevant in the compositions below.
3.1 Explicit Formulas for \(\psi \)
There is almost no work to be done in deriving \(\psi \) on \(\mathcal {E}\), since this is Smith’s \(\mathbb {Q}\)curve endomorphism corresponding to the degree2 family to which \(\mathcal {E}_\mathrm{W}\) belongs. We start with \(\psi _\mathrm{W} :\mathcal {E}_\mathrm{W} \rightarrow \mathcal {E}_\mathrm{W}\), taken from [46, Sect. 5], as \(\psi _\mathrm{W} :(x,y) \mapsto \left( \left( \frac{x}{2}\frac{c_{9,0,4,0}}{x4}\right) ^p, \left( \frac{y}{i\sqrt{2}} \left( \frac{1}{2}+\frac{c_{9,0,4,0}}{(x4)^2}\right) \right) ^p\right) \). With \(\psi _\mathrm{W}\) as above, we define \(\psi :\mathcal {E}\rightarrow \mathcal {E}\) as the composition \(\psi = \hat{\tau }\delta \psi _\mathrm{W} \delta ^{1} \tau \). In optimizing the explicit formulas for this composition, there is practically nothing to be gained by simplifying the full composition in the function field \(\mathbb {F}_{p^2}(\mathcal {E})\). However, it is advantageous to optimize explicit formulas for the inner composition \((\delta \psi _\mathrm{W} \delta ^{1})\) in the function field \(\mathbb {F}_{p^2}(\hat{\mathcal {E}})\). In fact, for both \(\psi \) and \(\phi \), optimized explicit formulas for this inner composition are faster than the respective endomorphisms \(\psi _\mathrm{W}\) and \(\phi _\mathrm{W}\), and are therefore much faster than computing the respective compositions individually.
3.2 Deriving Explicit Formulas for \(\phi \)
We now derive the second endomorphism \(\phi \) that arises from \(\mathcal {E}\) admitting CM by the order of discriminant \(D=40\). We start by pointing out that there is actually multiple routes that could be taken in defining and deriving \(\phi \) (see the full version [14] for additional details). The possibility that we use in this paper produces an endomorphism of degree 5. This option was revealed to us in correspondence with Ben Smith, who pointed out that \(\mathbb {Q}\)curves with CM can also be produced as the intersection of families of \(\mathbb {Q}\)curves, and that our curve \(\mathcal {E}\) is not only a degree2 \(\mathbb {Q}\)curve, but is also a degree5 \(\mathbb {Q}\)curve. Thus, the second endomorphism \(\phi \) can be derived by first following the treatment in [46, Sect. 7] (see also [27, Sect. 3.3]) to derive \(\phi _\mathrm{W}\) as a 5isogeny on \(\mathcal {E}_\mathrm{W}\), which we do below.
3.3 Eigenvalues
The eigenvalues of the two endomorphisms \(\psi \) and \(\phi \) play a key role in developing scalar decompositions. In this subsection we write them in terms of the curve parameters. From [46, Theorem 2], and given that we used a 4isogeny \(\tau \) and its dual to pass back and forth to \(\mathcal {E}_\mathrm{W}\), the eigenvalues of \(\psi \) on \(\mathcal {E}(\mathbb {F}_{p^2})[N]\) are \(\lambda _{\psi } := 4 \cdot \frac{p+1}{r} \pmod {N}\) and \(\lambda _{\psi }' :=  \lambda _{\psi } \pmod {N}\), where r is an integer satisfying \(2r^2 = 2p + t_{\mathcal {E}}\). To derive the eigenvalues for \(\phi \), we make use of the CM equation for \(\mathcal {E}\), which (since \(\mathcal {E}\) has CM by the order of discriminant \(D=40\)) is \(40V^2 = 4p^2t_\mathcal {E}^2\), for some integer V. We fix r and V to be the positive integers satisfying these equations, namely \(V:=4929397548930634471175140323270296814\) and \(r:=15437785290780909242\).
Proposition 2
3.4 Section Summary
Table 1 summarizes the isogenies derived in this section, together with their exact operation counts. The reason that multiples of 0.5 appear in the additions column is that we count Frobenius operations (which amount to a negation in \(\mathbb {F}_p\)) as half an addition in \(\mathbb {F}_{p^2}\). Fourdimensional scalar decompositions on \(\mathcal {E}\) require the computation of \(\phi (P)\), \(\psi (P)\) and the composition \(\psi (\phi (P))\); the ordering here is important since \(\psi \) is much faster than \(\phi \), meaning we actually compute \(\phi \) once and \(\psi \) twice. We note that all sets of explicit formulas were derived assuming the inputs were projective points \((X :Y :Z)\) corresponding to a point (X / Z, Y / Z) in the domain of the isogeny. Similarly, all explicit formulas output the point \((X' :Y' :Z')\) corresponding to \((X'/Z',Y'/Z')\) in the codomain, and in the special cases when the codomain is \(\mathcal {E}\) (i.e., for \(\hat{\tau }\), \(\phi \), \(\psi \) and \(\psi \phi \)), we also output the coordinate \(T'\) (or a related variant) corresponding to \(T'=X'Y'/Z'\), which facilitates faster subsequent group law formulas on \(\mathcal {E}\) – see [14].
Summary of isogenies used in the derivation of the three endomorphisms \(\phi \), \(\psi \) and \(\phi \psi \) on \(\mathcal {E}\), together with the cost of their explicit formulas. Here \(\mathbf{M}\), \(\mathbf{S}\) and \(\mathbf{A}\) respectively denote the costs of one multiplication, one squaring and one addition in \(\mathbb {F}_{p^2}\).
Isogeny  Domain & codomain  Degree  No. fixed constants  No. temp variables  Cost  

M  S  A  
\(\tau \)  \(\mathcal {E}\rightarrow \hat{\mathcal {E}}\)  4  1  2  5  3  5 
\(\hat{\tau }\)  \(\hat{\mathcal {E}} \rightarrow \mathcal {E}\)  4  1  2  5  3  4 
\((\delta \phi _\mathrm{W}\delta ^{1})\)  \(\hat{\mathcal {E}} \rightarrow \hat{\mathcal {E}}\)  5p  10  7  20  5  11.5 
\((\delta \psi _\mathrm{W}\delta ^{1})\)  \(\hat{\mathcal {E}} \rightarrow \hat{\mathcal {E}}\)  2p  4  2  9  2  5.5 
\(\phi \)  80p  11  7  30  11  20.5  
\(\psi \)  \(\mathcal {E}\rightarrow \mathcal {E}\)  32p  5  2  19  8  14.5 
\(\psi \phi \)  2560p    7  19  8  14.5  
total cost (\(\phi \), \(\psi \), \(\psi \phi \))  16  7  68  27  49.5 
4 Optimal Scalar Decompositions
Let \(\lambda _\psi \) and \(\lambda _\phi \) be as fixed in Sect. 3.3. In this section we show how to compute, for any integer scalar \(m \in \mathbb {Z}\), a corresponding 4dimensional multiscalar \((a_1,a_2,a_3,a_4) \in \mathbb {Z}^4\) such that \(m \equiv a_1+a_2\lambda _\phi +a_3\lambda _\psi +a_4\lambda _\phi \lambda _\psi \pmod {N}\), such that \(0\le a_i<2^{64}1\) for \(i=1,2,3,4\), and such that \(a_1\) is odd (which facilitates faster scalar recodings and multiplications – see Sect. 5). An excellent reference for general scalar decompositions in the context of elliptic curve cryptography is [45], where it is shown how to write down short lattice bases for scalar decompositions directly from the curve parameters. Here, we show how to further reduce such short bases into bases that are, in the context of multiscalar multiplications, optimal.
4.1 Babai Rounding and Optimal Bases
Since our scalar multiplications must run in time independent of m, the speed of the multiscalar exponentiations will depend on the worst case, i.e., on the maximal infinity norm taken across all elements in \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\). Or, equivalently, the speed of routine will depend on the width of the smallest 4cube whose convex body contains \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\). This width depends only on the choice of \(\mathbf{B}\), so this gives us a natural way of finding a basis that is optimal for our purposes. We make this concrete in the following definition, which is stated for an arbitrary lattice of dimension n. Definition 1 simplifies the situation by looking for the smallest ncube containing \(\mathcal {P}(\mathbf{B})\), rather than \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^n\), but our candidate bases will always be orthogonal enough such that the conditions are equivalent in practice.
Definition 1
(Babaioptimal bases). We say that a basis \(\mathbf{B}\) of a lattice \(\mathcal {L}\in \mathbb {R}^n\) is Babaioptimal if the width of the smallest ncube containing the parallelepiped \(\mathcal {P}(\mathbf{B})\) is minimal across all bases for \(\mathcal {L}\).
We note immediately that taking the n successive minima under \(\cdot _\ell \), for any \(\ell \in \{1,2,\dots ,\infty \}\), will not be Babaioptimal in general. Indeed, for our specific lattice \(\mathcal {L}\), neither the \(\cdot _2\)reduced basis (output from LLL [35]) or the \(\cdot _\infty \)reduced basis (in the sense of Lovász and Scarf [38]) are Babaioptimal.
For very low dimensions, such as those used in ECC scalar decompositions, we can find a Babaioptimal basis via straightforward enumeration as follows. Starting with any reasonably small basis \(\mathbf{B}'=(\mathbf {b}_1',\dots ,\mathbf {b}_n')\), like the ones in [45], we compute the width, \(w(\mathbf{B}')\), of the smallest ncube whose convex body contains \(\mathcal {P}(\mathbf{B}')\); by the definition of \(\mathcal {P}\), this is \(w(\mathbf{B}') = \mathrm{max}_{1 \le j \le n}\left\{ \sum _{i=1}^n \mathbf {b}'_i[j] \right\} \). We then enumerate the set S of all vectors \(\mathbf{v} \in \mathcal {L}\) such that \(\mathbf{v}_\infty \le w(\mathbf{B}')\); any vector not in S cannot be in a basis whose width is smaller than \(\mathbf{B}'\). We can then test all possible bases \(\mathbf{B}\), that are formed as combinations of n linearly independent vectors in S, and choose one corresponding to the minimal value of \(w(\mathbf{B})\).
Proposition 3
Proof
Straightforward but lengthy calculations using the equations in Sect. 3.3 reveal that \(\mathbf {b}_1\), \(\mathbf {b}_2\), \(\mathbf {b}_3\) and \(\mathbf {b}_4\) are all in \(\mathcal {L}\). Another direct calculation reveals that the determinant of \(\langle \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_3, \mathbf {b}_4 \rangle \) is N, so \(\mathbf{B}\) is a basis for \(\mathcal {L}\). To show that \(\mathbf{B}\) is Babaioptimal, we set \(\mathbf{B}'=\mathbf{B}\) and compute \(w(\mathbf{B}') = \mathrm{max}_{1 \le j \le 4}\left\{ \sum _{i=1}^4 \mathbf {b}'_i[j] \right\} \), which (at \(j=1\)) is \(w(\mathbf{B}')= (245\alpha +120r+17)/448\). Enumeration under \(\cdot _\infty \) yields exactly 128 vectors (up to sign) in \(S = \{\mathbf{v} \in \mathcal {L}\mid \mathbf{v}_\infty \le w(\mathbf{B}') \}\); none of the rank 4 bases formed from S have a width smaller than \(\mathbf{B}\). \(\square \)
The size of the set S in the above proof depends on the quality of the initial basis \(\mathbf{B}'\). For the proof, it suffices to start with the Babaioptimal basis \(\mathbf{B}\) itself, but in practice we will usually start with a basis that is not optimal according to Definition 1. In our case we computed the basis in Proposition 3 by first writing down a short basis using Smith’s methodology [45]. We input this into the LLL algorithm [35] to obtain an LLLreduced basis \(( \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_1+\mathbf {b}_4, \mathbf {b}_3)\); these are also the four successive minima under \( \cdot _2\). We then input this basis into the algorithm of Lovász and Scarf [38]; this forced the requisite changes to output a basis consisting of the four successive minima under \( \cdot _\infty \), namely \(( \mathbf {b}_1, \mathbf {b}_1+\mathbf {b}_4,\mathbf {b}_2,\mathbf {b}_1+\mathbf {b}_3)\). Using this as our input \(\mathbf{B}'\) into the enumeration gave a set S of size 282, which we exhaustively searched to find \(\mathbf{B}\).
We now describe a simple scalar decomposition that uses Babai rounding on the optimal basis above. Note that, since V and r are fixed, the four \(\hat{\alpha }_i\) values below are fixed integer constants.
Proposition 4
For a given integer m, and the basis \(\mathbf{B}:=( \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_3, \mathbf {b}_4 )\) in Prop. 3, let \((\alpha _1,\alpha _2,\alpha _3,\alpha _4) \in \mathbb {Q}^4\) be the unique solution to \((m,0,0,0) = \sum _{i=1}^{4} \alpha _i \mathbf {b}_i\), and let \((a_1, a_2, a_3,a_4) =(m,0,0,0) \sum _{i=1}^{4}\lfloor \alpha _i \rceil \cdot \mathbf {b}_i\). Then \(m \equiv a_1+a_2\lambda _\phi + a_3\lambda _\psi + a_4 \lambda _\psi \phi \pmod {N}\) and \(a_1,a_2, a_3, a_4 <2^{62}\).
4.2 Handling RoundOff Errors
The decomposition described in Proposition 4 requires the computation of four roundings \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \), where m is the input scalar and the four \(\hat{\alpha }_i\) and N are fixed curve constants. Following [10, Sect. 4.2], one efficient way of performing these roundings is to choose a power of 2 greater than the denominator N, say \(\mu \), and precompute the fixed curve constants \(\ell _i = \lfloor \frac{\hat{\alpha }_i}{N} \cdot \mu \rceil \), so that \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \) can be computed at runtime as \(\lfloor \frac{\ell _i \cdot m }{\mu } \rfloor \), and the division by \(\mu \) can be computed as a simple shift.
It is correctly noted in [10, Sect. 4.2] that computing the rounding in this way means the answer can be out by 1 in some cases, but it is further said that “in practice this does not affect the size of the multiscalars”. While this assertion may have been true in [10], in general this will not be the case, particularly when we wish to bound the size of the multiscalars as tightly as possible. We address this issue on \(\mathcal {E}\) starting with Lemma 1.
Lemma 1
Let \(\hat{\alpha }\) be any integer, and let m, N and \(\mu \) be positive integers with \(m < \mu \). Then \(\left\lfloor \frac{\hat{\alpha } m}{N} \right\rceil  \left\lfloor \left\lfloor \frac{\hat{\alpha } \mu }{N} \right\rceil \cdot \frac{m}{\mu } \right\rfloor \) is either 0 or 1.
Lemma 1 says that, so long as we choose \(\mu \) to be greater than the maximum size of our input scalars m, our fast method of approximating \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \) will either give the correct answer, or it will be \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil 1\). It is easy to see that larger choices of \(\mu \) decrease the probability of a rounding error. For example, on 10 million random decompositions of integers between 0 and N with \(\mu =2^{246}\), roughly 2.2 million trials gave at least one error in the \(\alpha _i\); when \(\mu =2^{247}\), roughly 1.7 million trials gave at least one error; when \(\mu =2^{256}\), 4333 trials gave an error; and, taking \(\mu =2^{269}\) was the first power of two that gave no errors.
Prior works have seemingly addressed this problem by taking \(\mu \) to be large enough so that the chance of roundoff errors are very (perhaps even exponentially) small. However, no matter how large \(\mu \) is chosen, the existence of a permissible scalar whose decomposition gives a roundoff error is still a possibility^{4}, and this could violate constanttime promises.
In this work, and in light of Theorem 1, we instead choose to sacrifice some speed by guaranteeing that roundoff errors are always accounted for. Rather than assuming that \((a_1,a_2,a_3,a_4)=\sum _{i=1}^4 (\alpha _i  \lfloor \alpha _i \rceil )\mathbf {b}_i\), we account for the approximation \(\tilde{\alpha }_i\) to \(\lfloor \alpha _i \rceil \) (described in Lemma 1) by allowing \((a_1, a_2, a_3,a_4) =\sum _{i=1}^4 \left( \alpha _i  \tilde{\alpha }_i \right) \mathbf {b}_i=\sum _{i=1}^4 \left( \alpha _i  (\lfloor {\alpha }_i\rceil  \epsilon _i) \right) \mathbf {b}_i\), for all sixteen combinations arising from \(\epsilon _i \in \{0,1\}\), for \(i=1,2,3,4\). This means that all integers less than \(\mu \) will decompose to a multiscalar in \(\mathbb {Z}^4\) whose coordinates lie inside the parallelepiped \(\mathcal {P}_\epsilon (\mathbf{B}):=\{\mathbf{B}{} \mathbf{x}\, \, \mathbf{x} \in [1/2,3/2)^4\}\). Theorem 1 permits scalars as any 256bit strings, so we fix \(\mu :=2^{256}\) from here on, which also means that division by \(\mu \) will correspond to a shift of machine words. The edges of \(\mathcal {P}_\epsilon (\mathbf{B})\) are twice as long as those of \(\mathcal {P}(\mathbf{B})\), so the number of points in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) is \(\mathrm{vol}(\mathcal {P}_\epsilon ) = 16N\). We note that, even though the number of permissible scalars far exceeds 16N, the decomposition that maps integers in \([0,\mu )\) to multiscalars in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) is certainly no longer onto; almost all of the \(\mu \) scalars will map into \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\), since the chance of roundoff errors that take us into \(\mathcal {P}_\epsilon (\mathbf{B})\mathcal {P}(\mathbf{B})\) is small. Plainly, the width of smallest 4cube containing \(\mathcal {P}_\epsilon (\mathbf{B})\) is also twice that of the 4cube containing \(\mathcal {P}(\mathbf{B})\), so (in the sense of Definition 1) our basis is still Babaioptimal. Nevertheless, the bounds in Proposition 4 no longer apply, which is one of the issues addressed in the next subsection.
4.3 AllPositive Multiscalars
Many points in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) have coordinates that are far greater than \(2^{62}\) in absolute value, and in addition, the majority of them will have coordinates that are both positive and negative. Dealing with such signed multiscalars can require an additional iteration in the main loop of the scalar multiplication, so in this subsection we use an offset vector in \(\mathcal {L}\) to find a translate of \(\mathcal {P}_\epsilon (\mathbf{B})\) that contains points whose four coordinates are always positive. We note that this does not save the additional iteration mentioned above, but (at no cost) it does simplify the scalar recoding, such that we do not have to deal with multiscalars that can have negative coordinates. Such offset vectors were used in two dimensions in [13, Sect. 4].
From the proof of Proposition 3, we have that the width of the smallest 4cube containing \(\mathcal {P}_\epsilon (\mathbf{B})\) is \(2\cdot (245\alpha +120r+17)/448\), which lies between \(2^{63}\) and \(2^{64}\). Thus, the optimal situation is to translate of \(\mathcal {P}_\epsilon (\mathbf{B})\) (using a vector in \(\mathcal {L}\)) that fits inside the convex body of the 4cube \(\mathcal {H} = \{2^{64}\cdot \mathbf{x}\, \, \mathbf{x} \in [0,1]^4\}\). In fact, as we discuss in the next paragraph, we actually want to find two unique translates of \(\mathcal {P}_\epsilon (\mathbf{B})\) inside \(\mathcal {H}\).
The scalar recoding described in Sect. 5 requires that the first component of the multiscalar \((a_1,a_2,a_3,a_4)\) is odd. In the case that \(a_1\) is even, which happens around half of the time, previous works have employed this “oddonly” recoding by instead working with the multiscalar \((a_11,a_2,a_3,a_4)\), and adding the point P to the value output by the main loop (cf. [41, Algorithm 4] and [18, Algorithm 2]). Of course, in a constanttime routine, this scalar update and point addition must be performed regardless of the parity of \(a_1\), and the correct scalars and results must be masked in and out of the main loop accordingly. In this work we simplify the situation by using offset vectors in \(\mathcal {L}\) to achieve the same result; this has the added advantage of avoiding an extra point addition. We do this by finding two vectors \(\mathbf{c}, \mathbf{c}' \in \mathcal {L}\) such that \(\mathbf{c}+\mathcal {P}_\epsilon (\mathbf{B})\) and \(\mathbf{c}' +\mathcal {P}_\epsilon (\mathbf{B})\) both lie inside \(\mathcal {H}\), and such that precisely one of \((a_1,a_2,a_3,a_4)+\mathbf{c}\) and \((a_1,a_2,a_3,a_4)+\mathbf{c}'\) has a first component that is odd. This is made explicit in the full scalar decomposition described below.
Proposition 5
(Scalar Decompositions). Let \(\mathbf{B}=(\mathbf {b}_1,\mathbf {b}_2,\mathbf {b}_3,\mathbf {b}_4)\) be the basis in Proposition 3, let \(\mu =2^{256}\), and define the four curve constants \(\ell _i:=\lfloor \hat{\alpha }_i \cdot \mu /N \rceil \) for \(i=1,2,3,4\), with the \(\hat{\alpha }_i\) as given in Proposition 4. Let \(\mathbf{c}=2\mathbf {b}_1\mathbf {b}_2+5\mathbf {b}_3+2\mathbf {b}_4\) and \(\mathbf{c}'= 2\mathbf {b}_1\mathbf {b}_2+5\mathbf {b}_3+\mathbf {b}_4\) in \(\mathcal {L}\). For any integer \(m \in [0,2^{256})\), let \(\tilde{\alpha }_i =\left\lfloor \ell _i m /\mu \right\rfloor \), and let \((a_1, a_2, a_3,a_4) =(m,0,0,0) \sum _{i=1}^{4}\lfloor \tilde{\alpha }_i \rceil \cdot \mathbf {b}_i\). Then, both of the multiscalars \((a_1,a_2,a_3,a_4)+\mathbf{c}\) and \((a_1,a_2,a_3,a_4)+\mathbf{c}'\) are valid decompositions of m, have all four coordinates positive and less than \(2^{64}\), and precisely one of them has a first coordinate that is odd.
The scalar decomposition described in Proposition 5 outputs two multiscalars. Our decomposition routine uses a bitmask to select and output the one with an odd first coordinate in constant time.
5 The Scalar Multiplication
This section describes the full scalar multiplication of \(P \in \mathcal {E}(\mathbb {F}_{p^2})\) by an integer \(m \in [0,2^{256})\), pulling together the endomorphisms and scalar decompositions derived in the previous two sections.
5.1 Recoding the Multiscalar
The “allpositive” multiscalar \((a_1,a_2,a_3,a_4)\) that is obtained from the decomposition described in Proposition 5 could be fed as is into a simple 4way multiexponentiation (e.g., the 4dimensional version of [48]) to achieve an efficient scalar multiplication. However, more care needs to be taken to obtain an efficient routine that also runs in constanttime. For example, we need to guarantee that the main loop iterates in the same number of steps, which would not currently be the case since \(\mathrm{max}_j(\mathrm{log}_2(a_j))\) can be several integers less than 64. As another example, a straightforward multiexponentiation could leak information in the case that the ith bit of all four \(a_j\) values was 0, which would result in a “donothing” rather than a nontrivial addition.
To achieve an efficient constanttime routine, we adopt the general recoding Algorithm from [18, Algorithm 1], and tailor it to scalar multiplications on Four\(\mathbb {Q}\). This results in Algorithm 1 below, which is presented in two flavors: one that is geared towards the general reader and one that is geared towards implementers (we note that the lines do not coincide for the most part). On input of any multiscalar \((a_1,a_2,a_3,a_4)\) produced by Proposition 5, Algorithm 1 outputs an equivalent multiscalar \((b_1,b_2,b_3,b_4)\) with \(b_j = \sum _{i=0}^{64}b_j[i] \cdot 2^i\) for \(b_j[i]\in \{1,0,1\}\) and \(j=1,2,3,4\), such that we always have \(b_1[64]=1\) and such that \(b_1[i]\) is nonzero for every \(i=0,\dots ,63\). This fixes the length of the main loop and ensures that each addition step of the multiexponentiation requires an addition by something other than the neutral element.
5.2 The Full Routine
Theorem 1
For every point \(P \in \mathcal {E}(\mathbb {F}_{p^2})[N]\) and every nonnegative integer m less than \(2^{256}\), Algorithm 2 computes [m]P correctly using a fixed sequence of field, integer and tablelookup operations.
6 Performance Analysis and Results
This section shows that, at the 128bit security level, Four\(\mathbb {Q}\) is significantly faster than all other known curvebased primitives. We reiterate that our software runs in constanttime and is therefore fully protected against timing and cache attacks.
6.1 Operation Counts
Operation counts for variablebase scalar multiplications on three different curves targeting the 128bit security level. In the case of the Kummer surface, we additionally use a “wordmul” column to count the number of special multiplications of a general element in \(\mathbb {F}_p\) by a small (i.e., oneword) constant – see [6].
primitive  prime char. p  op. count over \(\mathbb {F}_{p^2}\)  approx. op. count over \(\mathbb {F}_p\)  

inv  mul  sqr  add  inv  mul  add  wordmul  
Four\(\mathbb {Q}\)  \(2^{127}1\)  1  842  283  950.5  1  3092  6960   
GLV+GLS  \(2^{127}5997\)  1  833  191  769  1  2885  6278   
Kummer  \(2^{127}1\)          1  4319  8032  2008 
Table 2 shows that the GLV+GLS routine from [37] requires slightly fewer operations than Four\({\mathbb {Q}}\). This can mainly be explained by the faster endomorphisms, but (as we will see in Table 3) this difference is more than made up for by the faster modular arithmetic and superior simplicity of Four\(\mathbb {Q}\). Table 2 shows that Four\(\mathbb {Q}\) requires far fewer operations (in the same ground field) than Kummer; it is therefore expected, in general, that implementations based on Four\(\mathbb {Q}\) outperform Kummer implementations for computing variablebase scalar multiplications.
6.2 Experimental Results
To evaluate performance, we wrote a standalone library supporting Four \(\mathbb {Q}\) – see [15]. The library’s design pursues modularity and code reuse, and leverages the simplicity of Four\(\mathbb {Q}\)’s arithmetic. It also facilitates the addition of specialized code for different platforms and applications: the core functionality of the library is fully written in portable C and works together with pluggable implementations of the arithmetic over \(\mathbb {F}_{p^2}\) (and a few other complementary functions). The first release version of the library comes with two of those pluggable modules: a portable implementation written in C and a highperformance implementation for x64 platforms written in C and optional x64 assembly. The library computes all of the basic elliptic curve operations including variablebase and fixedbase scalar multiplications, making it suitable for a wide range of cryptographic protocols. In addition, the software permits the selection (at build time) of whether or not the endomorphisms \(\psi \) and \(\phi \) are to be exploited in variablebased scalar multiplications.
Performance results (expressed in terms of thousands of clock cycles) of stateoftheart implementations of various curves targeting the 128bit security level on various x64 platforms. Benchmark tests were taken with Intel’s TurboBoost and AMD’s TurboCore disabled and the results were rounded to the nearest 1000 clock cycles. The benchmarks for the Four\(\mathbb {Q}\) and GLV+GLS implementations were done on 1.66 GHz Intel Atom N570 Pineview, 3.4 GHz Intel Core i72600 Sandy Bridge, 3.4 GHz Intel Core i73770 Ivy Bridge, 3.4 GHz Intel Core i74770 Haswell and 3.1GHz AMD A8 PRO7600B Kaveri. For the Kummer implementations [6, 9] and Curve25519 implementation [3], Pineview, Sandy Bridge, Ivy Bridge and Haswell benchmarks were taken from eBACS [7] (machines h2atom, h6sandy, h9ivy and titan0), while AMD benchmarks were obtained by running eBACS’ SUPERCOP toolkit on the corresponding targeted machine. The benchmarks for curve NIST P256 were taken directly from [26] and the second set of Curve25519 benchmarks were taken directly from [12].
Processor  Operation  Four\(\mathbb {Q}\)  GLV+GLS  Kummer  Curve25519  P256  

(this work)  [18]  [9]  [6]  [3]  [12]  [26]  
Atom Pineview  varbase  442  N/A  556  N/A  1,109  N/A  N/A 
fixedbase  217  N/A    N/A    N/A  N/A  
ephem. DH  659  N/A  1,112  N/A  2,218  N/A  N/A  
Sandy Bridge  varbase  74  92  123  89  194  157  400 
fixedbase  42  51        54  90  
ephem. DH  116  143  246  178  388  211  490  
Ivy Bridge  varbase  71  89  119  88  183  159  N/A 
fixedbase  39  49        52  N/A  
ephem. DH  110  138  238  176  366  211  N/A  
Haswell  varbase  59  N/A  111  61  162  N/A  312 
fixedbase  33  N/A        N/A  67  
ephem. DH  92  N/A  222  122  324  N/A  379  
AMD Kaveri  varbase  122  N/A  151  164  301  N/A  N/A 
fixedbase  65  N/A        N/A  N/A  
ephem. DH  187  N/A  302  328  602  N/A  N/A 
Table 3 shows that, in comparison with the “conservative” curves, Four\(\mathbb {Q}\) is 2.1–2.7 times faster than the Curve25519 implementations in [3, 12] and up to 5.4 times faster than the curve P256 implementation in [26], when computing variablebase scalar multiplications. When considering the results for the DH key exchange, Four\(\mathbb {Q}\) performs 1.8–3.5 times faster than Curve25519 and up to 4.2 times faster than curve P256.
In terms of comparisons to the previously fastest implementations, variablebase scalar multiplications using our software are between 1.20 and 1.34 times faster than the Kummer [6, 9] and the GLV+GLS [18] implementations on AMD’s Kaveri and Intel’s Atom Pineview, Sandy Bridge and Ivy Bridge. The Kummer implementation for Haswell in [6] is particularly fast because it takes advantage of the powerful AVX2 vector instructions. Nevertheless, our implementation (which does not currently exploit vector instructions to accelerate the field arithmetic) is still faster in the case of variablebase scalar multiplication. Moreover, in practice we expect a much larger advantage. For example, in the case of the DH key exchange, we leverage the efficiency of fixedbase scalar multiplications to achieve a factor 1.33x speedup over the Kummer implementation on Haswell. For the rest of platforms considered in Table 3, a DH shared secret using the Four\(\mathbb {Q}\) software can be computed 1.5–1.8 times faster than a DH secret using the Kummer software in [6]. We note that the eBACS website [7] and [6] report different results for the same Kummer software on the same platform (i.e., Titan0): eBACS reports 60,556 Haswell cycles whereas [6] claims 54,389 Haswell cycles. This difference in performance raises questions regarding accuracy. The results that we obtained after running the eBACS’ SUPERCOP toolkit on our own targeted Haswell machine seem to confirm that the results claimed in [6] for the Kummer were measured with TurboBoost enabled.
Four \(\mathbb {Q}\) without endomorphisms. Our library can be built with a version of the variablebase scalar multiplication function that does not exploit the endomorphisms \(\psi \) and \(\phi \) to accelerate computations (note that fixedbase scalar multiplications do not exploit these endomorphisms by default). In this case, Four\(\mathbb {Q}\) computes one variablebase scalar multiplication in (respectively) 109, 131, 138 and 803 thousand cycles on the Haswell, Ivy Bridge, Sandy Bridge and Atom Pineview processors used for our experiments. These results are up to 2.9 times faster than the corresponding results for NIST P256 and up to 1.5 times faster than the corresponding results for Curve25519.
Footnotes
 1.
p stands alone as the only Mersenne prime suitable for highsecurity curves over quadratic extension fields. The next largest Mersenne prime is \(2^{521}1\), which is suitable only for prime field curves targeting the 256bit level.
 2.
Here, and throughout, \(\mathbf{I}\), \(\mathbf{M}\), \(\mathbf{S}\) and \(\mathbf{A}\) are used to denote the respective costs of inversions, multiplications, squarings and additions in \(\mathbb {F}_{p^2}\). We note that Frobenius operations amount to conjugations in \(\mathbb {F}_p\), which are tallied as \(0.5\mathbf{A}\).
 3.
This is a translate (by \(\frac{1}{2}(\sum _{i=1}^4 \mathbf {b}_i)\)) of the fundamental parallelepiped, which is defined using \(\mathbf{x} \in [0,1)^4\).
 4.
This is not technically true: so long as the set of permissible scalars is finite, there will always be a \(\mu \) large enough to round all scalar decompositions accurately, but finding or proving this is, to our knowledge, very difficult.
Notes
Acknowledgements
We thank Michael Naehrig for several discussions throughout this work, and Joppe Bos, Sorina Ionica and Greg Zaverucha for their comments on an earlier version of this paper. We are especially thankful to Ben Smith for pointing out the better option for \(\phi \) in Sect. 3.2.
References
 1.Ahmadi, O., Granger, R.: On isogeny classes of Edwards curves over finite fields. Cryptology ePrint Archive, Report 2011/135 (2011). http://eprint.iacr.org/
 2.Babai, L.: On Lovász’ lattice reduction and the nearest lattice point problem. Combinatorica 6(1), 1–13 (1986)MathSciNetCrossRefzbMATHGoogle Scholar
 3.Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.Y.: Highspeed highsecurity signatures. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 124–142. Springer, Heidelberg (2011) CrossRefGoogle Scholar
 4.Bernstein, D.J.: Curve25519: new DiffieHellman speed records. In: Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 207–228. Springer, Heidelberg (2006) CrossRefGoogle Scholar
 5.Bernstein, D.J., Birkner, P., Joye, M., Lange, T., Peters, C.: Twisted Edwards curves. In: Vaudenay, S. (ed.) AFRICACRYPT 2008. LNCS, vol. 5023, pp. 389–405. Springer, Heidelberg (2008) CrossRefGoogle Scholar
 6.Bernstein, D.J., Chuengsatiansup, C., Lange, T., Schwabe, P.: Kummer strikes back: new DH speed records. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014. LNCS, vol. 8873, pp. 317–337. Springer, Heidelberg (2014) Google Scholar
 7.Bernstein, D.J., Lange, T.: eBACS: ECRYPT Benchmarking of Cryptographic Systems. http://bench.cr.yp.to/resultsdh.html. Accessed on May 19 2015
 8.Bernstein, D.J., Lange, T.: Hyperandellipticcurve cryptography. LMS J. Comput. Math. 17(A), 181–202 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 9.Bos, J.W., Costello, C., Hisil, H., Lauter, K.: Fast cryptography in genus 2. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. LNCS, vol. 7881, pp. 194–210. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 10.Bos, J.W., Costello, C., Hisil, H., Lauter, K.: Highperformance scalar multiplication using 8dimensional GLV/GLS decomposition. In: Bertoni, G., Coron, J.S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 331–348. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 11.Certicom Research. Standards for Efficient Cryptography 2: Recommended Elliptic Curve Domain Parameters, v2.0. Standard SEC2, Certicom (2010)Google Scholar
 12.Chou, T.: Fastest Curve25519 implementation ever. In: Workshop on Elliptic Curve Cryptography Standards (2015). http://www.nist.gov/itl/csd/ct/eccworkshop.cfm
 13.Costello, C., Hisil, H., Smith, B.: Faster compact Diffie–Hellman: endomorphisms on the xline. In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT 2014. LNCS, vol. 8441, pp. 183–200. Springer, Heidelberg (2014) CrossRefGoogle Scholar
 14.Costello, C., Longa, P.: Four\(\mathbb{Q}\): fourdimensional decompositions on a \(\mathbb{Q}\)curve over the Mersenne prime (extended version). Cryptology ePrint Archive, Report 2015/565 2015. http://eprint.iacr.org/
 15.Costello, C., Longa, P.: Four\(\mathbb{Q}\)lib (2015). http://research.microsoft.com/fourqlib/
 16.Duursma, I.M., Gaudry, P., Morain, F.: Speeding up the discrete log computation on curves with automorphisms. In: Lam, K.Y., Okamoto, E., Xing, C. (eds.) ASIACRYPT 1999. LNCS, vol. 1716, pp. 103–121. Springer, Heidelberg (1999) CrossRefGoogle Scholar
 17.Edwards, H.: A normal form for elliptic curves. Bull. Am. Math. Soc. 44(3), 393–422 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
 18.FazHernández, A., Longa, P., Sánchez, A.H.: Efficient and secure algorithms for GLVbased scalar multiplication and their implementation on GLVGLS curves (extended version). J. Cryptographic Eng. 5(1), 31–52 (2015)CrossRefzbMATHGoogle Scholar
 19.Frey, G., Müller, M., Rück, H.: The Tate pairing and the discrete logarithm applied to elliptic curve cryptosystems. IEEE Trans. Inf. Theor. 45(5), 1717–1719 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
 20.Galbraith, S.D.: Mathematics of Public Key Cryptography. Cambridge University Press, Cambridge (2012)CrossRefzbMATHGoogle Scholar
 21.Galbraith, S.D., Lin, X., Scott, M.: Endomorphisms for faster elliptic curve cryptography on a large class of curves. J. Cryptology 24(3), 446–469 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
 22.Gallant, R.P., Lambert, R.J., Vanstone, S.A.: Faster point multiplication on elliptic curves with efficient endomorphisms. In: Kilian, J. (ed.) CRYPTO 2001. LNCS, vol. 2139, pp. 190–200. Springer, Heidelberg (2001) CrossRefGoogle Scholar
 23.Gaudry, P.: Fast genus 2 arithmetic based on Theta functions. J. Math. Cryptology 1(3), 243–265 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
 24.Gaudry, P.: Index calculus for abelian varieties of small dimension and the elliptic curve discrete logarithm problem. J. Symbolic Comput. 44(12), 1690–1702 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 25.Gaudry, P., Schost, E.: Genus 2 point counting over prime fields. J. Symbolic Comput. 47(4), 368–400 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 26.Gueron, S., Krasnov, V.: Fast prime field elliptic curve cryptography with 256 bit primes. J. Cryptographic Eng. 5(2), 141–151 (2015)CrossRefGoogle Scholar
 27.Guillevic, A., Ionica, S.: Fourdimensional GLV via the Weil restriction. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 79–96. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 28.Hamburg, M.: Twisting Edwards curves with isogenies. Cryptology ePrint Archive, Report 2014/027 (2014). http://eprint.iacr.org/
 29.Hasegawa, Y.: \(\mathbb{Q}\)curves over quadratic fields. Manuscripta Math. 94(1), 347–364 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
 30.Hisil, H., Costello, C.: Jacobian coordinates on genus 2 curves. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014. LNCS, vol. 8873, pp. 338–357. Springer, Heidelberg (2014) Google Scholar
 31.Hisil, H., Wong, K.K.H., Carter, G., Dawson, E.: Twisted Edwards curves revisited. In: Pieprzyk, J. (ed.) ASIACRYPT 2008. LNCS, vol. 5350, pp. 326–343. Springer, Heidelberg (2008) CrossRefGoogle Scholar
 32.Hu, Z., Longa, P., Xu, M.: Implementing 4dimensional GLV method on GLS elliptic curves with jinvariant 0. Des. Codes Cryptography 63(3), 331–343 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 33.Koblitz, A.H., Koblitz, N., Menezes, A.: Elliptic curve cryptography: the serpentine course of a paradigm shift. J. Number Theor. 131(5), 781–814 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
 34.Kohel, D.: Endomorphism rings of elliptic curves over finite fields. Ph.D. thesis, University of California at Berkeley (1996)Google Scholar
 35.Lenstra, A.K., Lenstra, H.W., Lovász, L.: Factoring polynomials with rational coefficients. Math. Ann. 261(4), 515–534 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
 36.Longa, P., Gebotys, C.: Efficient techniques for highspeed elliptic curve cryptography. In: Mangard, S., Standaert, F.X. (eds.) CHES 2010. LNCS, vol. 6225, pp. 80–94. Springer, Heidelberg (2010) CrossRefGoogle Scholar
 37.Longa, P., Sica, F.: Fourdimensional GallantLambertVanstone scalar multiplication. J. Cryptology 27(2), 248–283 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 38.Lovász, L., Scarf, H.E.: The generalized basis reduction algorithm. Math. Oper. Res. 17(3), 751–764 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
 39.Menezes, A., Vanstone, S.A., Okamoto, T.: Reducing elliptic curve logarithms to logarithms in a finite field. In: Koutsougeras, C., Vitter, J.S. (eds.) Proceedings of 23rd Annual ACM Symposium on Theory of Computing, pp. 80–89. ACM (1991)Google Scholar
 40.National Institute of Standards and Technology (NIST). 186–2. Digital Signature Standard (DSS). Federal Information Processing Standards (FIPS) Publication (2000)Google Scholar
 41.Oliveira, T., López, J., Aranha, D.F., RodríguezHenríquez, F.: Two is the fastest prime: lambda coordinates for binary elliptic curves. J. Cryptographic Eng. 4(1), 3–17 (2014)CrossRefGoogle Scholar
 42.Pollard, J.M.: Monte Carlo methods for index computation (mod \(p\)). Math. Comput. 32(143), 918–924 (1978)MathSciNetzbMATHGoogle Scholar
 43.Scholten, J.: Weil restriction of an elliptic curve over a quadratic extension (2004). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.7987&rep=rep1&type=pdf
 44.Smith, B.: Families of fast elliptic curves from \(\mathbb{Q}\)curves. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 61–78. Springer, Heidelberg (2013) CrossRefGoogle Scholar
 45.Smith, B.: Easy scalar decompositions for efficient scalar multiplication on elliptic curves and genus 2 Jacobians. In: Contemporary Mathematics Series, vol. 637, p. 15. American Mathematical Society (2015)Google Scholar
 46.Smith, B.: The \(\mathbb{Q}\)curve construction for endomorphismaccelerated elliptic curves. J. Cryptology (2015, to appear)Google Scholar
 47.Stark, H.M.: Classnumbers of complex quadratic fields. In: Kuijk, W. (ed.) Modular Functions of One Variable I, pp. 153–174. Springer, Heidelberg (1973) CrossRefGoogle Scholar
 48.Straus, E.G.: Addition chains of vectors. Am. Math. Mon. 70(806–808), 16 (1964)Google Scholar
 49.Vélu, J.: Isogénies entre courbes elliptiques. CR Acad. Sci. Paris Sér. AB 273, A238–A241 (1971)Google Scholar
 50.Wiener, M., Zuccherato, R.J.: Faster attacks on elliptic curve cryptosystems. In: Tavares, S., Meijer, H. (eds.) SAC 1998. LNCS, vol. 1556, pp. 190–200. Springer, Heidelberg (1999) CrossRefGoogle Scholar