Accelerating Number Theoretic Transform in GPU Platform for qTESLA Scheme

  • Wai-Kong LeeEmail author
  • Sedat Akleylek
  • Wun-She Yap
  • Bok-Min Goi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11879)


Post-quantum cryptography had attracted a lot of attentions in recent years, due to the potential threat emerged from quantum computer against traditional public key cryptography. Among all post-quantum candidates, lattice-based cryptography is considered the most promising and well studied one. The most time consuming operation in lattice-based cryptography schemes is polynomial multiplication. Through careful selection of the lattice parameters, the polynomial multiplication can be accelerated by Number Theoretic Transform (NTT) and massively parallel architecture like Graphics Processing Units (GPU). However, existing NTT implementation in GPU only focuses on parallelizing one of the three for loop, which eventually causes slow performance and warp divergence. In this paper, we proposed a strategy to mitigate this problem and avoid the warp divergence. To verify the effectiveness of the proposed strategy, the NTT was implemented following the lattice parameters in qTESLA, which is one of the round 2 candidates in NIST Post-Quantum Standardization competition. To the best of our knowledge, this is the first implementation of NTT in GPU with parameters from qTESLA. The proposed implementation can be used to accelerate qTESLA signature generation and verification in batch, which is very useful under server environment. On top of that, the proposed GPU implementation can also be generalized to other lattice-based schemes.


Number Theoretic Transform Lattice-based cryptography Graphics Processing Units Post-quantum cryptography 


  1. 1.
    D-Wave Systems. Accessed 24 May 2019
  2. 2.
    Shor, P.: Algorithms for quantum computation: discrete logarithm and factoring. In: IEEE Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pp. 124–134. IEEE, Santa Fe (1994)Google Scholar
  3. 3.
    NIST Post-Quantum Cryptography Standardization: Round 2 Submissionn. Accessed 25 May 2019
  4. 4.
    Du, C., Bai, G.: Efficient Polynomial Multiplier Architecture for Ring-LWE Based Public Key Cryptosystems. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1162–1165. IEEE, Montreal (2016)Google Scholar
  5. 5.
    Dai, W., Chen, D., Cheung, R.C.C., Koc, C.K.: FFT-based McLaughlin’s Montgomery exponentiation without conditional selections. IEEE Trans. Comput. 67(9), 1301–1314 (2018)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Montgomery, P.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Maza, M.M., Pan, W.: Fast polynomial multiplication on a GPU. J. Phys. 256(1), 1–14 (2010). Conference SeriesGoogle Scholar
  8. 8.
    Emmart, N., Weems, C.C.: High precision integer multiplication witha GPU using Strassen’s algorithm with multiple FFT sizes. Parallel Process. Lett. 21(3), 359–375 (2011)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Wang, W., Hu, Y., Chen, L., Huang, X., Sunar, B.: Exploring the feasibility of fully homomorphic encryption. IEEE Trans. Comput. 64(3), 698–706 (2013)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Akleylek S., Tok, Z.Y.: Efficient arithmetic for lattice-based cryptography on GPU using the CUDA platform. In: 22nd Signal Processing and Communications Applications Conference (SIU). IEEE, Trabzon (2014)Google Scholar
  11. 11.
    Akleylek, S., Dağdelen, Ö., Yüce Tok, Z.: On the efficiency of polynomial multiplication for lattice-based cryptography on GPUs using CUDA. In: Pasalic, E., Knudsen, L.R. (eds.) BalkanCryptSec 2015. LNCS, vol. 9540, pp. 155–168. Springer, Cham (2016). Scholar
  12. 12.
    Bindel, N., et al.: qTESLA. Accessed 1 June 2019
  13. 13.
    Pollard, J.M.: The fast Fourier transform in a finite field. Math. Comput. 25(114), 365–374 (1971)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Shone, N., Ngoc, T.N., Phai, V.D., Shi, Q.: A Deep Learning approach to network intrusion detection. IEEE Trans. Emerg. Top. Comput. Intell. 2(1), 41–50 (2018)CrossRefGoogle Scholar
  16. 16.
    Lee, W.K., Achar, R., Nakhla, M.S.: Dynamic GPU parallel sparse LU factorization for fast circuit simulation. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(11), 2518–2529 (2018)CrossRefGoogle Scholar
  17. 17.
    Emmart, N., Zheng, F., Weems, C.: Faster modular exponentiation using double precision floating point arithmetic on the GPU. In: Proceedings of the IEEE 25th Symposium on Computer Arithmetic, pp. 130–137. IEEE, Amherst Massachusetts (2018)Google Scholar
  18. 18.
    Lyubashevsky, V., et al.: CRYSTALS-DILITHIUM.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Universiti Tunku Abdul RahmanJalan UniversitiKamparMalaysia
  2. 2.Department of Computer EngineeringOndokuz Mayıs UniversitySamsunTurkey
  3. 3.Universiti Tunku Abdul RahmanKajangMalaysia

Personalised recommendations