Keywords

1 Introduction

Bilinear maps, or pairings, between the (divisors on the) groups of points of certain algebraic curves over a finite field, particularly the Weil pairing [94] and the Tate (or Tate-Lichtenbaum) pairing [45], have been introduced in a cryptological scope for destructive cryptanalytic purposes, namely, mapping the discrete logarithm problem on those groups to the discrete logarithm problem on the multiplicative group of a certain extension of the base field [46, 66]: while the best generic classical (non-quantum) algorithm for the discrete logarithm problem on the former groups may be exponential, in the latter case subexponential algorithms are known, so that such a mapping may yield a problem that is asymptotically easier to solve.

It turned out, perhaps surprisingly, that these same tools have a much more relevant role in a constructive cryptographic context, as the basis for the definition of cryptosystems with unique properties. This has been shown in the seminal works on identity-based non-interactive authenticated key agreement by Sakai, Ohgishi and Kasahara [84], and on one-round tripartite key agreement by Joux [56], which then led to an explosion of protocols exploring the possibilities of identity-based cryptography and many other schemes, with ever more complex features.

All this flexibility comes at a price: pairings are notoriously expensive in implementation complexity and processing time (and/or storage occupation, in a trade-off between time and space requirements). This imposes a very careful choice of algorithms and curves to make them really practical. The pioneering approach by Miller [67, 68] showed that pairings could be computed in polynomial time, but there is a large gap from there to a truly efficient implementation approach.

Indeed, progress in this line of research has not only revealed theoretical bounds on how efficiently a pairing can be computed in the sense of its overall order of complexity [93], but actually the literature has now very detailed approaches on how to attain truly practical, extremely optimized implementations that cover all operations typically found in a pairing-based cryptosystem, rather than just the pairing itself [4, 80]. One can therefore reasonably ask how far this trend can be pushed, and how “notoriously expensive” pairings really are (or even whether they really are as expensive as the folklore pictures them).

Our Contribution. In this paper we review the evolution of pairing-based cryptosystems, the development of efficient algorithms for the computation of pairings and the state of the art in the area, and the challenges yet to be addressed on the subject.

Furthermore, we provide some new refinements to the pairing computation in affine and projective coordinates over ordinary curves, perform an up-to-date analysis of the best algorithms for the realization of pairings with special focus on the 128-bit security level and present a very efficient implementation for x64 platforms.

Organization. The remainder of this paper is organized as follows. Section 2 introduces essential notions on elliptic curves and bilinear maps for cryptographic applications, including some of the main pairing-based cryptographic protocols and their underlying security assumptions. Section 3 reviews the main proposals for pairing-friendly curves and the fundamental algorithms for their construction and manipulation. In Sect. 4, we describe some optimizations to formulas in affine and projective coordinates, carry out a performance analysis of the best available algorithms and discuss benchmarking results of our high-speed implementation targeting the 128-bit security level on various x64 platforms. We conclude in Sect. 5.

2 Preliminary Concepts

Let \(q = p^m\). An elliptic curve \(E/\mathbb {F}_q\) is a smooth projective algebraic curve of genus one with at least one point. The affine part satisfies an equation of the form \(E: y^2 + a_1 xy + a_3 y = x^3 + a_2 x^2 + a_4 x + a_6\) where \(a_i \in \mathbb {F}_q\). Points on \(E\) are affine points \((x, y) \in \mathbb {F}_q^2\) satisfying the curve equation, together with an additional point at infinity, denoted \(\infty \). The set of curve points whose coordinates lie in a particular extension field \(\mathbb {F}_{q^k}\) is denoted \(E(\mathbb {F}_{q^k})\) for \(k > 0\) (note that the \(a_i\) remain in \(\mathbb {F}_q\)). Let \(\#E(\mathbb {F}_q)=n\) and write \(n\) as \(n=p+1-t\); \(t\) is called the trace of the Frobenius endomorphism. By Hasse’s theorem, \(|t| \leqslant 2\sqrt{q}\).

An (additive) Abelian group structure is defined on \(E\) by the well known chord-and-tangent method [91]. The order of a point \(P \in E\) is the least nonzero integer \(r\) such that \([r]P = \infty \), where \([r]P\) is the sum of \(r\) terms equal to \(P\). The order \(r\) of a point divides the curve order \(n\). For a given integer \(r\), the set of all points \(P \in E\) such that \([r]P = \infty \) is denoted \(E[r]\). We say that \(E[r]\) has embedding degree \(k\) if \(r \;|\; q^k - 1\) and \(r \not \mid q^s - 1\) for any \(0 < s < k\).

The complex multiplication (CM) method [37] constructs an elliptic curve with a given number of points \(n\) over a given finite field \(\mathbb {F}_q\) as long as \(n = q + 1 - t\) as required by the Hasse bound, and the norm equation \(DV^2 = 4q - t^2\) can be solved for “small” values of the discriminant \(D\), from which the \(j\)-invariant of the curve (which is a function of the coefficients of the curve equation) can be computed, and the curve equation is finally given by \(y^2 = x^3 + b\) (for certain values of \(b\)) when \(j = 0\), by \(y^2 = x^3 + ax\) (for certain values of \(a\)) when \(j = 1728\), and by \(y^2 = x^3 - 3cx + 2c\) with \(c := j/(j - 1728)\) when \(j \not \in \{0, 1728\}\).

A divisor is a finite formal sum \(\mathcal {A} = \sum _P{a_P(P)}\) of points on the curve \(E(\mathbb {F}_{q^k})\). An Abelian group structure is defined on the set of divisors by the addition of corresponding coefficients in their formal sums; in particular, \(n\mathcal {A} = \sum _P{(n \, a_P)(P)}\). The degree of a divisor \(\mathcal {A}\) is the sum \(\deg (\mathcal {A}) = \sum _P{a_P}\). Let \(f: E(\mathbb {F}_{q^k}) \rightarrow \mathbb {F}_{q^k}\) be a function on the curve. We define \(f(\mathcal {A}) \equiv \prod _P{f(P)^{a_P}}\). Let \({{\mathrm{ord}}}_P(f)\) denote the multiplicity of the zero or pole of \(f\) at \(P\) (if \(f\) has no zero or pole at \(P\), then \({{\mathrm{ord}}}_P(f) = 0\)). The divisor of \(f\) is \((f) := \sum _P{{{\mathrm{ord}}}_P(f)(P)}\). A divisor \(\mathcal {A}\) is called principal if \(\mathcal {A} = (f)\) for some function \((f)\). A divisor \(\mathcal {A}\) is principal if and only if \(\deg (\mathcal {A}) = 0\) and \(\sum _P{a_P P} = \infty \) [65, theorem 2.25]. Two divisors \(\mathcal {A}\) and \(\mathcal {B}\) are equivalent, \(\mathcal {A} \sim \mathcal {B}\), if their difference \(\mathcal {A} - \mathcal {B}\) is a principal divisor. Let \(P \in E(\mathbb {F}_q)[r]\) where \(r\) is coprime to \(q\), and let \(\mathcal {A}_P\) be a divisor equivalent to \((P) - (\infty )\); under these circumstances the divisor \(r\mathcal {A}_P\) is principal, and hence there is a function \(f_P\) such that \((f_P) = r\mathcal {A}_P = r(P) - r(\infty )\).

Given three groups \(\mathbb {G}_1\), \(\mathbb {G}_2\), and \(\mathbb {G}_T\) of the same prime order \(n\), a pairing is a feasibly computable, non-degenerate bilinear map \(e: \mathbb {G}_1 \times \mathbb {G}_2 \rightarrow \mathbb {G}_T\). The groups \(\mathbb {G}_1\) and \(\mathbb {G}_2\) are commonly (in the so-called Type III pairing setting) determined by the eigenspaces of the Frobenius endomorphism \(\phi _q\) on some elliptic curve \(E/\mathbb {F}_q\) of embedding degree \(k>1\). More precisely, \(\mathbb {G}_1\) is taken to be the 1-eigenspace \(E[n] \cap \ker (\phi _q - [1]) = E(\mathbb {F}_q)[n]\). The group \(\mathbb {G}_2\) is usually taken to be the preimage \(E'(\mathbb {F}_{q^g})[n]\) of the \(q\)-eigenspace \(E[n] \cap \ker (\phi _q - [q]) \subseteq E(\mathbb {F}_{q^k})[n]\) under a twisting isomorphism \(\psi : E' \rightarrow E\), \((x, y) \mapsto (\mu ^2 x, \mu ^3 y)\) for some \(\mu \in \mathbb {F}_{q^k}^*\). In particular, \(g = k/d\) where the curve \(E'/\mathbb {F}_{q^g}\) is the unique twist of \(E\) with largest possible twist degree \(d \mid k\) for which \(n\) divides \(\#E'(\mathbb {F}_{q^g})\) (see [55] for details). This means that \(g\) is as small as possible.

A Miller function \(f_{i,P}\) is a function with divisor \((f_{i,P}) = i(P) - ([i]P) - (i-1)(\infty )\). Miller functions are at the root of most if not all pairings proposed for cryptographic purposes, which in turn induce efficient algorithms derived from Miller’s algorithm [67, 68]. A Miller function satisfies \(f_{a+b,P}(Q) = f_{a,P}(Q) \cdot f_{b,P}(Q) \cdot g_{[a]P,[b]P}(Q) / g_{[a+b]P}(Q)\) up to a constant nonzero factor in \(\mathbb {F}_q\), for all \(a, b \in \mathbb {Z}\), where the so-called line functions \(g_{[a]P,[b]P}\) and \(g_{[a+b]P}\) satisfy \((g_{[a]P,[b]P}) = ([a]P) + ([b]P) + (-[a+b]P) - 3(\infty )\), \((g_{[a+b]P}) = ([a+b]P) + (-[a+b]P) - 2(\infty )\). The advantage of Miller functions with respect to elliptic curve arithmetic is now clear, since with these relations the line functions, and hence the Miller functions themselves, can be efficiently computed as a side result during the computation of \([n]P\) by means of the usual chord-and-tangent method.

2.1 Protocols and Assumptions

As an illustration of the enormous flexibility that pairings bring to the construction of cryptographic protocols, we present a (necessarily incomplete) list of known schemes according to their overall category.

Foremost among pairing-based schemes are the identity-based cryptosystems. These include plain encryption [17], digital signatures [24, 83], (authenticated) key agreement [25], chameleon hashing [27], and hierarchical extensions thereof with or without random oracles [22, 51].

Other pairing-based schemes are not identity-based but feature special functionalities like secret handshakes [5], short/aggregate/verifiably encrypted/ group/ring/blind signatures [19, 20, 26, 97, 98] and signcryption [9, 21, 61].

Together with the abundance of protocols came a matching abundance of security assumptions, often tailored to the nature of each particular protocol although some assumptions found a more general use and became classical. Some of the most popular and useful security assumptions occurring in security proofs of pairing-based protocols are the following, with groups \(\mathbb {G}_1\) and \(\mathbb {G}_2\) of order \(n\) in multiplicative notation (and \(\mathbb {G}\) denotes either group):

  • \(\mathsf {q}\)-Strong Diffie-Hellman (\(\mathsf {q}\)-SDH) [16] and many related assumptions (like the Inverse Computational Diffie-Hellman (Inv-CDH), the Square Computational Diffie-Hellman (Squ-CDH), the Bilinear Inverse Diffie-Hellman (BIDH), and the Bilinear Square Diffie-Hellman (BSDH) assumptions [98]): Given a \((\mathsf {q}+2)\) -tuple \((g_1,g_2, g_2^x, \dots , g_2^{x^\mathsf {q}}) \in \mathbb {G}_1 \times \mathbb {G}_2^{\mathsf {q}+1}\) as input, compute a pair \((c, g_1^{1/(x+c)}) \in \mathbb {Z}/n\mathbb {Z}\times \mathbb {G}_1\).

  • Decision Bilinear Diffie-Hellman (DBDH) [18] and related assumptions (like the \(k\)-BDH assumption [14]): Given generators \(g_1\) and \(g_2\) of \(\mathbb {G}_1\) and \(\mathbb {G}_2\) respectively, and given \(g_1^a\), \(g_1^b\), \(g_1^c\), \(g_2^a\), \(g_2^b\), \(g_2^c\), \(e(g_1,g_2)^z\) determine whether \(e(g_1,g_2)^{abc} = e(g_1,g_2)^z\).

  • Gap Diffie-Hellman (GDH) assumption [77]: Given \((g, g^a, g^b) \in \mathbb {G}^3\) for a group \(\mathbb {G}\) equipped with an oracle for deciding whether \(g^{ab} = g^c\) for any given \(g^c \in \mathbb {G}\), find \(g^{ab}\).

  • \((k + 1)\) Exponent Function meta-assumption: Given a function \(f: \mathbb {Z}/n\mathbb {Z}\rightarrow \mathbb {Z}/n\mathbb {Z}\) and a sequence \((g, g^a, g^{f(h_1+a)}, \dots , g^{f(h_k+a)}) \in \mathbb {G}_1^{k+2}\) for some \(a\), \(h_1, \dots , h_k \in \mathbb {Z}/n\mathbb {Z}\), compute \(g^{f(h+a)}\) for some \(h \notin \{h_1, \dots , h_k\}\).

The last of these is actually a meta-assumption, since it is parameterized by a function \(f\) on the exponents. This meta-assumption includes the Collusion attack with \(k\) traitors (\(k\)-CAA) assumption [70], where \(f(x) := 1/x\), and the \((k + 1)\) Square Roots (\((k+1)\)-SR) assumption [96], where \(f(x) := \sqrt{x}\), among others. Of course, not all choices of \(f\) may lead to a consistent security assumption (for instance, the constant function is certainly a bad choice), so the instantiation of this meta-assumption must be done in a case-by-case basis.

Also, not all of these assumptions are entirely satisfactory from the point of view of their relation to the computational complexity of the more fundamental discrete logarithm problem. In particular, the Cheon attack [28, 29] showed that, contrary to most discrete-logarithm style assumptions, which usually claim a practical security level of \(2^\lambda \) for \(2\lambda \)-bit keys due to e.g. the Pollard-\(\rho \) attack [81], the \(\mathsf {q}\)-SDH assumption may need \(3\lambda \)-bit keys to attain that security level, according to the choice of \(\mathsf {q}\).

3 Curves and Algorithms

3.1 Supersingular Curves

Early proposals to obtain efficient pairings invoked the adoption of supersingular curves [40, 49, 82], which led to the highly efficient concept of \(\eta \) pairings [7] over fields of small characteristic. This setting enables the so called Type I pairings, which are defined with both arguments from the same group [50] and facilitates the description of many protocols and the construction of formal security proofs. Unfortunately, recent developments bring that approach into question, since discrete logarithms in the multiplicative groups of the associated extension fields have proven far easier to compute than anticipated [6].

Certain ordinary curves, on the other hand, are not known to be susceptible to that line of attack, and also yield very efficient algorithms, as we will see next.

3.2 Generic Constructions

Generic construction methods enable choosing the embedding degree at will, limited only by efficiency requirements. Two such constructions are known:

  • The Cocks-Pinch construction [32] enables the construction of elliptic curves over \(\mathbb {F}_q\) containing a pairing-friendly group of order \(n\) with \(\lg (q)/\lg (n) \approx 2\).

  • The Dupont-Enge-Morain strategy [39] is similarly generic in the sense of its embedding degree flexibility by maximizing the trace of the Frobenius endomorphism. Like the Cocks-Pinch method, it only attains \(\lg (q)/\lg (n) \approx 2\).

Because the smallest attainable ratio \(\lg (q)/\lg (n)\) is relatively large, these methods do not yield curves of prime order, which are necessary for certain applications like short signatures, and also tend to improve the overall processing efficiency.

3.3 Sparse Families of Curves

Certain families of curves may be obtained by parameterizing the norm equation \(4q - t^2 = 4hn - (t - 2)^2 = DV^2\) with polynomials \(q(u)\), \(t(u)\), \(h(u)\), \(n(u)\), then choosing \(t(u)\) and \(h(u)\) according to some criteria (for instance, setting \(h(u)\) to be some small constant polynomial yields near-prime order curves), and directly finding integer solutions (in \(u\) and \(V\)) to the result. In practice this involves a clever mapping of the norm equation into a Pell-like equation, whose solutions lead to actual curve equations via complex multiplication (CM).

The only drawback they present is the relative rarity of suitable curves (the only embedding degrees that are known to yield solutions are \(k \in \{3, 4, 6, 10\}\), and the size of the integer solutions \(u\) grows exponentially), especially those with prime order. Historically, sparse families are divided into Miyaji-Nakabayashi-Takano (MNT) curves and Freeman curves.

MNT curves were the first publicly known construction of ordinary pairing-friendly curves [71]. Given their limited range of admissible embedding degrees (namely, \(k \in \{3, 4, 6\}\)), the apparent finiteness of MNT curves of prime order [58, 63, 92], and efficiency considerations (see e.g. [44]), MNT curves are less useful for higher security levels (say, from about \(2^{112}\) onward).

Freeman curves [43], with embedding degree \(k = 10\), are far rarer and suffer more acutely from the fact that the nonexistence of a twist of degree higher than quadratic forces its \(\mathbb {G}_2\) group to be defined over \(\mathbb {F}_{q^5}\). Besides, this quintic extension cannot be constructed using a binomial representation.

3.4 Complete Families of Curves

Instead of trying to solve the partially parameterized norm equation \(4h(u)n(u) - (t(u) - 2)^2 = DV^2\) for \(u\) and \(V\) directly as for the sparse families of curves, one can also parameterize \(V = V(u)\) as well. Solutions may exist if the parameters can be further constrained, which is usually done by considering the properties of the number field \(\mathbb {Q}[u]/n(u)\), specifically by requiring that it contains a \(k\)-th root of unity where \(k\) is the desired embedding degree. Choosing \(n(u)\) to be a cyclotomic polynomial \(\varPhi _\ell (u)\) with \(k \mid \ell \) yields the suitably named cyclotomic family of curves [10, 11, 23, 44], which enable a reasonably small ratio \(\rho := \lg (q)/\lg (n)\) (e.g. \(\rho = (k+1)/(k-1)\) for prime \(k \equiv 3 \pmod {4}\)).

Yet, there is one other family of curves that attain \(\rho \approx 1\), namely, the Barreto-Naehrig (BN) curves [12]. BN curves arguably constitute one of the most versatile classes of pairing-friendly elliptic curves. A BN curve is an elliptic curve \(E_u: y^2 = x^3 + b\) defined over a finite primeFootnote 1 field \(\mathbb {F}_p\) of (typically prime) order \(n\), where \(p\) and \(n\) are given by \(p = p(u) = 36u^4 + 36u^3 + 24u^2 + 6u + 1\) and \(n = n(u) = 36u^4 + 36u^3 + 18u^2 + 6u + 1\) (hence \(t = t(u) = 6u^2 + 1\)) for \(u \in \mathbb {Z}\). One can check by straightforward inspection that \(\varPhi _{12}(t(u) - 1) = n(u) n(-u)\), hence \(\varPhi _{12}(p(u)) \equiv \varPhi _{12}(t(u) - 1) \equiv 0 \pmod {n(u)}\), so the group of order \(n(u)\) has embedding degree \(k = 12\).

BN curves also have \(j\)-invariant 0, so there is no need to resort explicitly to the CM curve construction method: all one has to do is choose an integer \(u\) of suitable size such that \(p\) and \(n\) as given by the above polynomials are prime. To find a corresponding curve, one chooses \(b \in \mathbb {F}_p\) among the six possible classes so that the curve \(E: y^2 = x^3 + b\) has order \(n\).

Furthermore, BN curves admit a sextic twist (\(d=6\)), so that one can set \(\mathbb {G}_2 = E'(\mathbb {F}_{p^2})[n]\). This twist \(E'/\mathbb {F}_{p^2}\) may be selected by finding a non-square and non-cube \(\xi \in \mathbb {F}_{p^2}\) and then checking via scalar multiplication whether the curve \(E': y^2 = x^3 + b'\) given by \(b' = b/\xi \) or by \(b' = b/\xi ^5\) has order divisible by \(n\). However, construction methods are known that dispense with such procedure, yielding the correct curve and its twist directly [80]. For convenience, following [85] we call the twist \(E': y^2 = x^3 + b/\xi \) a \(D\)-type twist, and we call the twist \(E': y^2 = x^3 + b\xi \) an \(M\)-type twist.

3.5 Holistic Families

Early works targeting specifically curves that have some efficiency advantage have focused on only one or a few implementation aspects, notably the pairing computation itself [13, 15, 38, 90].

More modern approaches tend to consider most if not all efficiency aspects that arise in pairing-based schemes [34, 36, 80]. This means that curves of those families tend to support not only fast pairing computation, but efficient finite field arithmetic for all fields involved, curve construction, generator construction for both \(\mathbb {G}_1\) and \(\mathbb {G}_2\), multiplication by a scalar in both \(\mathbb {G}_1\) and \(\mathbb {G}_2\), point sampling, hashing to the curve [42], and potentially other operations as well.

Curiously enough, there is not a great deal of diversity among the most promising such families, which comprise essentially only BN curves, BLS curves [10], and KSS curves [57].

3.6 Efficient Algorithms

Ordinary curves with small embedding degree also come equipped with efficient pairing algorithms, which tend to be variants of the Tate pairing [8, 48, 55, 60, 76] (although some fall back to the Weil pairing while remaining fairly efficient [94]). In particular, one now knows concrete practical limits to how efficient a pairing can be, in the form of the so-called optimal pairings [93].

As we pointed out, Miller functions are essential to the definition of most cryptographic pairings. Although all pairings can be defined individually in formal terms, it is perhaps more instructive to give the following constructive definitions, assuming an underlying curve \(E/\mathbb {F}_q\) containing a group \(E(\mathbb {F}_q)[n]\) of prime order \(n\) with embedding degree \(k\) and letting \(z := (q^k-1)/n\):

  • Weil pairing: \(w(P,Q) := (-1)^n f_{n,P}(Q)/f_{n,Q}(P)\).

  • Tate pairing: \(\tau (P,Q) := f_{n,P}(Q)^z\).

  • Eta pairing [7] (called the twisted Ate pairing when defined over an ordinary curve): \(\eta (P,Q) := f_{\lambda ,P}(Q)^z\) where \(\lambda ^d \equiv 1 \pmod {n}\).

  • Ate pairing [55]: \(a(P,Q) := f_{t - 1,Q}(P)^z\), where \(t\) is the trace of the Frobenius.

  • Optimized Ate and twisted Ate pairings [64]: \(a_c(P,Q) := f_{(t - 1)^c \mod n,Q}(P)^z\), \(\eta _c(P,Q) := f_{\lambda ^c \mod n,P}(Q)^z\), for some \(0 < c < k\).

  • Optimal Ate pairing [93]: \(a_{\mathrm {opt}}(P,Q) := f_{\ell ,Q}(P)^z\) for a certain \(\ell \) such that \(\lg \ell \approx (\lg n)/\varphi (k)\).

Optimal pairings achieve the shortest loop length among all of these pairings. To obtain unique values, most of these pairings (the Weil pairing is an exception) are reduced via the final exponentiation by \(z\). The very computation of \(z\) is the subject of research per se [89]. In particular, for a BN curve with parameter \(u\) there exists an optimal Ate pairing with loop length \(\ell = |6u + 2|\).

A clear trend in recent works has been to attain exceptional performance gains by limiting the allowed curves to a certain subset, sometimes to a single curve at a useful security level [4, 15, 75, 80]. In the next section, we discuss aspects pertaining such implementations.

4 Implementation Aspects

The optimal Ate pairing on BN curves has been the focus of intense implementation research in the last few years. Most remarkably, beginning in 2008, a series of works improved, each one on top of the preceding one, the practical performance on Intel 64-bit platforms [15, 54, 75]. This effort reached its pinnacle in 2011, when Aranha et al. [4] reported an implementation running in about half a millisecond (see also [62]). Since then, performance of efficient software implementations has mostly stabilized, but some aspects of pairing computation continously improved through the availability of new techniques [47], processor architecture revisions and instruction set refinements [79]. In this section, we revisit the problem of efficient pairing computation working on top of the implementation presented in [4], to explore these latest advances and provide new performance figures. Our updated implementation achieves high performance on a variety of modern 64-bit computing platforms, including both relatively old processors and latest microarchitectures.

4.1 Pairing Algorithm

The BN family of curves is ideal from an implementation point of view. Having embedding degree \(k=12\), it is perfectly suited to the 128-bit security level and a competitive candidate at the 192-bit security level for protocols involving a small number of pairing computations [2]. Additionally, the size of the family facilitates generation [80] and supports many different parameter choices, allowing for customization of software implementations to radically different computing architectures [4, 52, 53]. The optimal Ate pairing construction applied to general BN curves further provides a rather simple formulation among the potential candidates [60, 76]:

$$\begin{aligned} a_{\mathrm {opt}}: \mathbb {G}_2\times \mathbb {G}_1&\rightarrow \mathbb {G}_T\\ (Q,P)&\mapsto (f_{\ell ,Q}(P) \cdot g_{[\ell ]Q,\phi _p(Q)}(P) \cdot g_{[\ell ]Q + \phi _p(Q),-\phi _p^2(Q)}(P))^{\frac{p^{12} - 1}{n}}, \end{aligned}$$

with \(\ell = 6u + 2\), map \(\phi _p\) and groups \(\mathbb {G}_1,\mathbb {G}_2,\mathbb {G}_T\) as previously defined; and an especially efficient modification of Miller’s Algorithm for accumulating all the required line evaluations in the Miller variable \(f\) (Algorithm 1).

The extension field arithmetic involving \(f\) is in fact the main building block of the pairing computation, including Miller’s algorithm and final exponentiation. Hence, its efficient implementation is crucial. To that end, it has been recommended to implement the extension field through a tower of extensions built with appropriate choices of irreducible polynomials [15, 38, 54, 80]:

$$\begin{aligned} \mathbb {F}_{p^{2}}&= \mathbb {F}_p[i]/(i^2 - \beta ), \mathrm{with }\; \beta \;{\text {a non-square}},\end{aligned}$$
(1)
$$\begin{aligned} \mathbb {F}_{p^{4}}&= \mathbb {F}_{p^{2}}[s]/(s^2 - \xi ), \mathrm{with }\; \xi \; {\text {a non-square}},\end{aligned}$$
(2)
$$\begin{aligned} \mathbb {F}_{p^{6}}&= \mathbb {F}_{p^{2}}[v]/(v^3 - \xi ), \mathrm{with }\; \xi \; {\text {a non-cube}},\end{aligned}$$
(3)
$$\begin{aligned} \mathbb {F}_{p^{12}}&= \mathbb {F}_{p^{4}}[t]/(t^3 - s)\end{aligned}$$
(4)
$$\begin{aligned}&\mathrm{or}\;\; \mathbb {F}_{p^{6}}[w]/(w^2 - v)\end{aligned}$$
(5)
$$\begin{aligned}&\mathrm{or}\;\; \mathbb {F}_{p^{2}}[w]/(w^6 - \xi ), \mathrm{with }\; \xi \; {\text {a non-square and non-cube}}. \end{aligned}$$
(6)

Note that \(\xi \) is the same non-residue used to define the twist equations in Sect. 3.4 and that converting from one towering scheme to another is possible by simply reordering coefficients. By allowing intermediate values to grow to double precision and choosing \(p\) to be a prime number slightly smaller than a multiple of the processor word, lazy reduction can be efficiently employed in all levels of the towering arithmetic [4]. A remarkably efficient set of parameters arising from the curve choice \(E(\mathbb {F}_p) : y^2 = x^3 + 2\), with \(p \equiv 3 \pmod {4}\), is \(\beta = -1\), \(\xi = (1+i)\) [80], simultaneously optimizing finite field and curve arithmetic.

figure a

4.2 Field Arithmetic

Prime fields involved in pairing computation in the asymmetric setting are commonly represented with dense moduli, resulting from the parameterized curve constructions. While the particular structure of the prime modulus has been successfully exploited for performance optimization in both software [75] and hardware [41], current software implementations rely on the standard Montgomery reduction [72] and state-of-the-art hardware implementations on the parallelization capabilities of the Residue Number System [30].

Arithmetic in the base field is usually implemented in carefully scheduled Assembly code, but the small number of words required to represent a 256-bit prime field element in a 64-bit processor encourages the use of Assembly directly in the quadratic extension field, to avoid penalties related to frequent function calls [15]. Multiplication and reduction in \(\mathbb {F}_p\) are implemented through a Comba strategy [33], but a Schoolbook approach is favored in recent Intel processors, due to the availability of the carry-preserving multiplication instruction mulx, allowing delayed handling of carries [79]. Future processors will allow similar speedups on the Comba-based multiplication and Montgomery reduction routines by carry-preserving addition instructions [78].

Divide-and-conquer approaches are used only for multiplication in \(\mathbb {F}_{p^{2}}\), \(\mathbb {F}_{p^{6}}\) and \(\mathbb {F}_{p^{12}}\), because Karatsuba is typically more efficient over extension fields, since additions are relatively inexpensive in comparison with multiplication. The full details of the formulas that we use in our implementation of extension field arithmetic can be found in [4], including the opportunities for reducing the number of Montgomery reductions via lazy reduction. The case of squaring is relatively more complex. We use the complex squaring in \(\mathbb {F}_{p^{2}}\) and, for \(\mathbb {F}_{p^{6}}\) and \(\mathbb {F}_{p^{12}}\), we employ the faster Chung-Hasan asymmetric SQR3 formula [31]. The sparseness of the line functions motivates the implementation of specialized multiplication routines for accumulating the line function into the Miller variable \(f\) (sparse multiplication) or for multiplying line functions together (sparser multiplication). For sparse multiplication over \(\mathbb {F}_{p^6}\) and \(\mathbb {F}_{p^{12}}\), we use the formulas proposed by Grewal et al. (see Algorithms 5 and 6 in [53]). Faster formulas for sparser multiplication can be trivially obtained by adapting the sparse multiplication formula to remove operations involving the missing subfield elements.

In the following, we closely follow notation for operation costs from [4]. Let \(m,s,a,i\) denote the cost of multiplication, squaring, addition and inversion in \(\mathbb {F}_p\), respectively; \(\tilde{m}, \tilde{s}, \tilde{a}, \tilde{\imath }\) denote the cost of multiplication, squaring, addition and inversion in \(\mathbb {F}_{p^{2}}\), respectively; \(m_u,s_u,r\) denote the cost of unreduced multiplication and squaring producing double-precision results, and modular reduction of double-precision integers, respectively; \(\tilde{m}_u,\tilde{s}_u,\tilde{r}\) denote the cost of unreduced multiplication and squaring, and modular reduction of double-precision elements in \(\mathbb {F}_{p^{2}}\), respectively. To simplify the operation count, we consider the cost of field subtraction, negation and division by two equivalent to that of field addition. Also, one double-precision addition is considered equivalent to the cost of two single-precision additions.

4.3 Curve Arithmetic

Pairings can be computed over elliptic curves represented in any coordinate system, but popular choices have been homogeneous projective and affine coordinates, depending on the ratio between inversion and multiplication. Jacobian coordinates were initially explored in a few implementations [15, 75], but ended superseded by homogeneous coordinates because of their superior efficiency [35]. Point doublings and their corresponding line evaluations usually dominate the cost of the Miller loop, since efficient parameters tend to minimize the Hamming weight of the Miller variable \(\ell \) and the resulting number of points additions. Below, we review and slightly refine the best formulas available for the curve arithmetic involved in pairing computation on affine and homogeneous projective coordinates.

Affine Coordinates. The choice of affine coordinates has proven more useful at higher security levels and embedding degrees, due to the action of the norm map on simplifying the computation of inverses at higher extensions [59, 86]. The main advantages of affine coordinates are the simplicity of implementation and format of the line functions, allowing faster accumulation inside the Miller loop if the additional sparsity is exploited. If \(T = (x_1,y_1)\) is a point in \(E'(\mathbb {F}_{p^{2}})\), one can compute the point \(2T := T + T\) with the following formula [53]:

$$\begin{aligned} \begin{array}{c} \lambda = \displaystyle \frac{3x_1^2}{2y_1},\,\,\, x_3 = \lambda ^2 - 2x_1,\,\,\, y_3 = (\lambda x_1 - y_1) - \lambda x_3. \end{array} \end{aligned}$$
(7)

When \(E'\) is a \(D\)-type twist given by the twisting isomorphism \(\psi \), the tangent line evaluated at \(P = (x_P,y_P)\) has the format \(g_{2\psi (T)}(P) = y_P - \lambda x_Pw + (\lambda x_1 - y_1)w^3\) according to the tower representation given by Eq. (6). This function can be evaluated at a cost of \(3\tilde{m} + 2\tilde{s} + 7\tilde{a} + \tilde{\imath } + 2m\) with the precomputation cost of \(1a\) to compute \(\overline{x}_P = -x_P\) [53]. By performing more precomputation as \(y'_P = 1/y_P\) and \(x'_P = \overline{x}_P/y_P\), we can simplify the tangent line further:

$$\begin{aligned} y'_P \cdot g_{2\psi (T)}(P) = 1 + \lambda x'_Pw + y'_P(\lambda x_1 - y_1)w^3. \end{aligned}$$

Since the final exponentiation eliminates any subfield element multiplying the pairing value, this modification does not change the pairing result. Computing the simpler line function now requires \(3\tilde{m} + 2\tilde{s} + 7\tilde{a} + \tilde{\imath } + 4m\) with an additional precomputation cost of \((i + m)\):

$$\begin{aligned} \begin{array}{c} A = \dfrac{1}{2y_1},\,\,\,B = 3x_1^2,\,\,\,C = AB,\,\,\,D = 2x_1,\,\,\,x_3 = C^2 - D,\\ \,\,\,E = Cx_1 - y_1,\,\,\,y_3 = E - Cx_3,\,\,\,F = Cx'_P,\,\,\,G = Ey'_P,\\ y'_P \cdot g_{2\psi (T)}(P) = 1 + Fw + Gw^3. \end{array} \end{aligned}$$

This clearly does not save any operations compared to Eq. (7) and increases the cost by \(2m\). However, the simpler format allows the faster accumulation \(f^2 \cdot g_{2\psi (T)}(P) = (f_0 + f_1w)(1 + g_1w)\), where \(f_0,f_1,g_1 \in \mathbb {F}_{p^{6}}\), by saving \(6m\) corresponding to the multiplication between \(y_P\) and each subfield element of \(f_0\). The performance trade-off compared to [53] is thus \(4m\) per Miller doubling step.

When different points \(T = (x_1,y_1)\) and \(Q = (x_2,y_2)\) are considered, the point \(T + Q\) can be computed with the following formula:

$$\begin{aligned} \begin{array}{c} \lambda = \dfrac{y_2 - y_1}{x_2 - x_1},\,\,\, x_3 = \lambda ^2 - x_2 - x_1,\,\,\, y_3 = \lambda (x_1 - x_3) - y_1. \end{array} \end{aligned}$$
(8)

Applying the same trick described above gives the same performance trade-off, with a cost of \(3\tilde{m} + \tilde{s} + 6\tilde{a} + \tilde{\imath } + 4m\) [53]:

$$\begin{aligned} \begin{array}{c} A = \dfrac{1}{x_2 - x_1},\,\,\,B = y_2 - y_1,\,\,\,C = AB,\,\,\,D = x_1 + x_2,\,\,\,x_3 = C^2 - D,\\ \,\,\,E = Cx_1 - y_1,\,\,\,y_3 = E - Cx_3,\,\,\,F = Cx'_P,\,\,\,G = Ey'_P,\\ y'_P \cdot g_{\psi (T),\psi (Q)}(P) = 1 + Fw + Gw^3. \end{array} \end{aligned}$$

The technique can be further employed in \(M\)-type twists, conserving their equivalent performance to \(D\)-type twists [53], with some slight changes in the formula format and accumulation multiplier. A generalization for other pairing-friendly curves with degree-\(d\) twists and even embedding degree \(k\) would provide a performance trade-off of \((k/2 - k/d)\) multiplications per step in Miller’s Algorithm. The same idea was independently proposed and slightly improved in [73].

Homogeneous Projective Coordinates. The choice of projective coordinates has proven especially advantageous at the 128-bit security level for single pairing computation, due to the typically large inversion/multiplication ratio in this setting. If \(T = (X_1,Y_1,Z_1) \in E'(\mathbb {F}_{p^{2}})\) is a point in homogeneous coordinates, one can compute the point \(2T = (X_3,Y_3,Z_3)\) with the following formula [4]:

$$\begin{aligned} \begin{array}{c} X_3 = \displaystyle \frac{X_1Y_1}{2}(Y_1^2-9b'Z_1^2), \\ Y_3 = \left[ \displaystyle \frac{1}{2}(Y_1^2+9b'Z_1^2)\right] ^2-27b'^2Z_1^4,\,\,\, Z_3 = 2Y_1^3Z_1. \end{array} \end{aligned}$$
(9)

The twisting point \(P\) can be represented by \((x_Pw, y_P)\). When \(E'\) is a \(D\)-type twist given by the twisting isomorphism \(\psi \), the tangent line evaluated at \(P = (x_P,y_P)\) can be computed with the following formula [53]:

$$\begin{aligned} g_{2\psi (T)}(P) = -2YZy_P + 3X^2x_Pw + (3b'Z^2-Y^2)w^3 \end{aligned}$$
(10)

Equation (10) is basically the same line evaluation formula presented in [35] plus an efficient selection of the positioning of terms (obtained by multiplying the line evaluation by \(w^3\)), which was suggested in [53] to obtain a fast sparse multiplication in the Miller loop (in particular, the use of terms \(1, w\) and \(w^3\) [53] induces a sparse multiplication that saves \(13\tilde{a}\) in comparison to the use of terms \(1, v^2\) and \(wv\) in [4]). The full doubling/line function formulae in [35] costs \(2\tilde{m} + 7\tilde{s} + 23\tilde{a} + 4m + m_{b'}\). Based on Eqs. (9) and (10), [53] reports a cost of \(2\tilde{m} + 7\tilde{s} + 21\tilde{a} + 4m + m_{b'}\). We observe that the same formulae can be evaluated at a cost of only \(2\tilde{m} + 7\tilde{s} + 19\tilde{a} + 4m + m_{b'}\) with the precomputation cost of \(3a\) to compute \(\overline{y}_P = -y_P\) and \(x'_P = 3x_P\). Note that all these costs consider the computation of \(X_1 \cdot Y_1\) using the equivalence \(2XY = (X+Y)^2 - X^2 - Y^2\). We remark that, as in Aranha et al. [4], on x64 platforms it is more efficient to compute such term with a direct multiplication since \(\tilde{m} - \tilde{s} < 3\tilde{a}\). Considering this scenario, the cost applying our precomputations is then given by \(3\tilde{m} + 6\tilde{s} + 15\tilde{a} + 4m + m_{b'}\). Finally, further improvements are possible if \(b\) is cleverly selected [80]. For instance, if \(b=2\) then \(b'=2/(1+i)=1-i\), which minimizes the number of additions and subtractions. Computing the simpler doubling/line function now requires \(3\tilde{m} + 6\tilde{s} + 16\tilde{a} + 4m\) with the precomputation cost of \(3a\) (in comparison to the computation proposed in [4, 35, 53], we save \(2\tilde{a}, 3\tilde{a}\) and \(5\tilde{a}\), respectively, when \(\tilde{m} - \tilde{s} < 3\tilde{a}\)):

$$\begin{aligned} \begin{array}{c} A = X_1 \cdot Y_1/2,\,\,\, B = Y_1^2,\,\,\, C = Z_1^2,\,\,\, D = 3C,\,\,\, E_0 = D_0+D_1, \\ E_1 = D_1-D_0,\,\,\, F = 3E,\,\,\, X_3 = A \cdot (B - F),\,\,\, G = (B + F)/2, \\ Y_3 = G^2 - 3E^2,\,\,\, H = \left( Y_1+Z_1 \right) ^2 - (B+C),\,\,\, Z_3 = B \cdot H, \\ g_{2\psi (T)}(P) = H\bar{y_P} + X_1^2x'_Pw + (E-B)w^3. \end{array} \end{aligned}$$
(11)

Similarly, if \(T = (X_1,Y_1,Z_1)\) and \(Q = (x_2,y_2) \in E'(\mathbb {F}_{p^{2}})\) are points in homogeneous and affine coordinates, respectively, one can compute the point \(T+Q = (X_3,Y_3,Z_3)\) with the following formula:

$$\begin{aligned} \begin{array}{c} X_3 = \lambda (\lambda ^3 + Z_1\theta ^2 - 2X_1\lambda ^2), \\ Y_3 = \theta (3X_1\lambda ^2 - \lambda ^3 - Z_1\theta ^2) - Y_1 \lambda ^3,\,\,\, Z_3 = Z_1\lambda ^3, \end{array} \end{aligned}$$
(12)

where \(\theta = Y_1 - y_2 Z_1\) and \(\lambda = X_1 - x_2 Z_1\). In the case of a \(D\)-type twist, the line evaluated at \(P = (x_P,y_P)\) can be computed with the following formula [53]:

$$\begin{aligned} g_{\psi (T+Q)}(P) = -\lambda y_P - \theta x_Pw + (\theta X_2 - \lambda Y_2)w^3. \end{aligned}$$
(13)

Similar to the case of doubling, Eq. (13) is basically the same line evaluation formula presented in [35] plus an efficient selection of the positioning of terms suggested in [53] to obtain a fast sparse multiplication inside the Miller loop. The full mixed addition/line function formulae can be evaluated at a cost of \(11\tilde{m} + 2\tilde{s} + 8\tilde{a} + 4m\) with the precomputation cost of \(2a\) to compute \(\overline{x}_P = -x_P\) and \(\overline{y}_P = -y_P\) [53]:

$$\begin{aligned} \begin{array}{c} A=Y_2Z_1,\,\,\, B=X_2Z_1,\,\,\, \theta = Y_1-A,\,\,\, \lambda =X_1-B,\,\,\, C=\theta ^2, \\ D=\lambda ^2,\,\,\, E=\lambda ^3,\,\,\, F=Z_1 C,\,\,\, G=X_1 D,\,\,\, H=E+F-2G, \\ X_3=\lambda H,\,\,\, I = Y_1 E,\,\,\, Y_3=\theta (G-H)-I,\,\,\, Z_3 = Z_1E,\,\,\, J = \theta X_2 - \lambda Y_2, \\ g_{2\psi (T)}(P) = \lambda \bar{y}_P + \theta \bar{x}_Pw + Jw^3. \end{array} \end{aligned}$$

In the case of an \(M\)-type twist, the line function evaluated at \(\psi (P)=(x_P w^2, y_Pw^3)\) can be computed with the same sequence of operations shown above.

4.4 Operation Count

Table 1 presents a detailed operation count for each operation relevant in the computation of a pairing over a BN curve, considering all the improvements described in the previous section. Using these partial numbers, we obtain an operation count for the full pairing computation on a fixed BN curve.

Table 1. Computational cost for arithmetic required by Miller’s Algorithm.

Miller Loop. Sophisticated pairing-based protocols may impose additional restrictions on the parameter choice along with some performance penalty, for example requiring the cofactor of the \(\mathbb {G}_T\) group to be a large prime number [87]. For efficiency and a fair comparison with related works, we adopt the parameters \(\beta \), \(\xi \), \(b = 2\), \(u = -(2^{62} + 2^{55} + 1)\) from [80]. For this set of parameters, the Miller loop in Algorithm 1 and the final line evaluations execute some amount of precomputation for accelerating the curve arithmetic formulas, 64 points doublings with line evaluations and 6 point additions with line evaluations; a single \(p\)-power Frobenius, a single \(p^2\)-power Frobenius and 2 negations in \(E'(\mathbb {F}_{p^{2}})\); and 66 sparse accumulations in the Miller variable, 2 sparser multiplications, 1 multiplication, 1 conjugation and 63 squarings in \(\mathbb {F}_{p^{12}}\). The corresponding costs in affine and homogeneous projective coordinates are, respectively:

$$\begin{aligned} \mathrm{MLA }&= (i + m + a) + 64 \cdot (3\tilde{m} + 2\tilde{s} + 7\tilde{a} + \tilde{\imath } + 4m)\\&+ ~6 \cdot (3\tilde{m} + \tilde{s} + 6\tilde{a} + \tilde{\imath } + 4m) + 2\tilde{m} + 2a + 2m + 2\tilde{a}\\&+ ~66 \cdot (10\tilde{m}_u + 6\tilde{r} + 31\tilde{a}) + 2 \cdot (5\tilde{m}_u + 3\tilde{r} + 13\tilde{a})\\&+ ~3\tilde{a} + (18\tilde{m}_u + 6\tilde{r} + 110\tilde{a}) + 63 \cdot (3\tilde{m}_u + 12\tilde{s}_u + 6\tilde{r} + 93\tilde{a})\\&= 1089\tilde{m}_u + 890\tilde{s}_u + 1132\tilde{r} + 8530\tilde{a} + 70\tilde{\imath } + i + 283m + 3a. \end{aligned}$$
$$\begin{aligned} \mathrm{MLP }&= (4a) + 64 \cdot (3\tilde{m}_u + 6\tilde{s}_u + 8\tilde{r} + 19\tilde{a} + 4m)\\&+ ~6 \cdot (11\tilde{m}_u + 2\tilde{s}_u + 11\tilde{r} + 10\tilde{a} + 4m) + 2\tilde{m} + 2a + 2m + 2\tilde{a}\\&+ ~66 \cdot (13\tilde{m}_u + 6\tilde{r} + 48\tilde{a}) + 2 \cdot (6\tilde{m}_u + 5\tilde{r} + 22\tilde{a})\\&+ ~3\tilde{a} + (18\tilde{m}_u + 6\tilde{r} + 110\tilde{a}) + 63 \cdot (3\tilde{m}_u + 12\tilde{s}_u + 6\tilde{r} + 93\tilde{a})\\&= ~1337\tilde{m}_u + 1152\tilde{s}_u + 1388\tilde{r} + 10462\tilde{a} + 282m + 6a. \end{aligned}$$

Final Exponentiation. For computing the final exponentiation, we employ the state-of-the-art approach by [47] in the context of BN curves. As initially proposed by [89], power \(\frac{p^{12} - 1}{r}\) is factored into the easy exponent \((p^6 - 1)(p^2 + 1)\) and the hard exponent \(\frac{p^4 - p^2 + 1}{n}\). The easy power is computed by a short sequence of multiplications, conjugations, fast applications of the Frobenius map [15] and a single inversion in \(\mathbb {F}_{p^{12}}\). The hard power is computed in the cyclotomic subgroup, where additional algebraic structure allows elements to be compressed and squared consecutively in their compressed form, with decompression required only when performing multiplications [4, 74, 88].

Moreover, lattice reduction is able to obtain parameterized multiples of the hard exponent and significantly reduce the length of the addition chain involved in that exponentiation [47]. In total, the hard part of the final exponentiation requires 3 exponentiations by parameter \(u\), 3 squarings in the cyclotomic subgroup, 10 full extension field multiplications and 3 applications of the Frobenius maps with increasing \(p\)th-powers. We refer to [4] for the cost of an exponentiation by our choice of \(u\) and compute the exact operation count of the final exponentiation:

$$\begin{aligned} \mathrm{FE }&= (23\tilde{m}_u + 11\tilde{s}_u + 16\tilde{r} + 129\tilde{a} + \tilde{\imath }) + 3\tilde{a} + 12 \cdot (18\tilde{m}_u + 6\tilde{r} + 110\tilde{a})\\&+ ~3 \cdot (45\tilde{m}_u + 378\tilde{s}_u + 275\tilde{r} + 2164\tilde{a} + \tilde{\imath }) + 3 \cdot (9\tilde{s}_u + 6\tilde{r} + 46\tilde{a})\\&+ ~(5\tilde{m} + 6a) + 2 \cdot (10m + 2\tilde{a}) + (5\tilde{m} + 2\tilde{a} + 6a)\\&= ~384\tilde{m}_u + 1172\tilde{s}_u + 941\tilde{r} + 8085\tilde{a} + 4\tilde{\imath } + 20m + 12a. \end{aligned}$$

4.5 Results and Discussion

The combined cost for a pairing computation in homogeneous projective coordinates can then be expressed as:

$$\begin{aligned} \mathrm{MLP }+\mathrm{FE }&= 1721\tilde{m}_u + 2324\tilde{s}_u + 2329\tilde{r} + 18547\tilde{a} + 4\tilde{\imath } + i + 302m + 18a\\&= 9811m_u + 4658r + 57384a + 4\tilde{\imath } + i + 302m + 18a\\&= 10113m_u + 4960r + 57852a + 4\tilde{\imath } + i. \end{aligned}$$

A direct comparison with a previous record-setting implementation [4], considering only the number of multiplications in \(\mathbb {F}_p\) generated by arithmetic in \(\mathbb {F}_{p^{2}}\) as the performance metric, shows that our updated implementation in projective coordinates saves 3.4 % of the base field multiplications. This reflects the faster final exponentiation adopted from [47] and the more efficient formulas for inversion and squaring in \(\mathbb {F}_{p^{12}}\). These formulas were not the most efficient in [4] due to higher number of additions, but this additional cost is now offset by improved addition handling and faster division by 2. Now comparing the total number of multiplications with more recent implementations [69, 95], our updated implementation saves 1.9 %, or 198 multiplications.

The pairing code was implemented in the C programming language, with the performance-critical code implemented in Assembly. The compiler used was GCC version 4.7.0, with switches turned on for loop unrolling, inlining of small functions to reduce function call overhead and optimization level \(\mathtt {-O3}\). Performance experiments were executed in a broad set of 64-bit Intel-compatible platforms: older Nehalem Core i5 540M 2.53 GHz and AMD Phenom II 3.0 GHz processors, and modern Sandy Bridge Xeon E31270 3.4 GHz and Ivy Bridge Core i5 3570 3.4 GHz processors, including a recent Haswell Core i7 4750 HQ 2.0 GHz processor. All machines had automatic overclocking capabilities disabled to reduce randomness in the results. Table 2 presents the timings split in the Miller loop and final exponentiation. This is not only useful for more fine-grained comparisons, but also to allow more accurate estimates of the latency of multi-pairings or precomputed pairings. The complete implementation will be made available in the next release of the RELIC toolkit [3].

Table 2. Comparison between implementations based on affine and projective coordinates on 64-bit architectures. Timings are presented in \(10^3\) clock cycles and were collected as the average of \(10^4\) repetitions of the same operation. Target platforms are AMD Phenom II (P II) and Intel Nehalem (N), Sandy Bridge (SB), Ivy Bridge (IB), Haswell (H) with or without support to the mulx instruction.

We obtain several performance improvements in comparison with current literature. Our implementation based on projective coordinates improves results from [4] by 6 % and 9 % in the Nehalem and Phenom II machines, respectively. Comparing to an updated version [95] of a previous record setting implementation [15], our Sandy Bridge timings are faster by 82,000 cycles, or 5 %. When independently benchmarking their available software in the Ivy Bridge machine, we observe a latency of 1,403 K cycles, thus an improvement by our software of 5 %. Now considering the Haswell results from the same software available at [69], we obtain a speedup of 8 % without taking into account the mulx instruction and comparable performance when mulx is employed. It is also interesting to note that the use of mulx injects a relatively small speedup of 3 %. When exploiting such an instruction, the lack of carry-preserving addition instructions in the first generation of Haswell processors makes an efficient implementation of Comba-based multiplication and Montgomery reduction difficult, favoring the use of the typically slower Schoolbook versions. We anticipate a better support for Comba variants with the upcoming addition instructions [78].

In the implementation based on affine coordinates, the state-of-the-art results at the 128-bit security level is the one described by Acar et al. [1]. Unfortunately, only the latency of 15,6 million cycles on a Core 2 Duo is provided for 64-bit Intel architectures. While this does not allow a direct comparison, observing the small performance improvement between the Core 2 Duo and Nehalem reported in [4] implies that our affine implementation should be around 6 times faster than [1] when executed in the same machine.

Despite being slower than our own projective version, our affine implementation is still considerably faster than some previous speed records on projective coordinates [15, 54, 75]. This hints at the possibility that affine pairings could be improved even further, contrary to the naive intuition that the affine representation is exceedingly worse than a projective approach.

5 Conclusion

Pairings are amazingly flexible tools that enable the design of innovative cryptographic protocols. Their complex implementation has been the focus of intense research since the beginning of the millennium in what became a formidable race to make it efficient and practical.

We have reviewed the theory behind pairings and covered state-of-the-art algorithms, and also presented some further optimizations to the pairing computation in affine and projective coordinates, and analyzed the performance of the most efficient algorithmic options for pairing computation over ordinary curves at the 128-bit security level. In particular, our implementations of affine and projective pairings using Barreto-Naehrig curves shows that the efficiency of these two approaches are not as contrasting as it might seem, and hints that further optimizations might be possible. Remarkably, the combination of advances in processor technology and carefully crafted algorithms brings the computation of pairings close to the one million cycle mark.