Implementation and Optimization of Multi-dimensional Real FFT on ARMv8 Platform

  • Xiao Wang
  • Haipeng JiaEmail author
  • Zhihao Li
  • Yunquan Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11335)


Fourier Transform is one of the most critical algorithms, and is applied in a wide range of fields like signal processing and data compression. In real world applications, such as image compression (JPEG), Fourier Transform is concentrated in processing real number input. These transforms are called real DFT (real discrete fourier transform) in this paper. Thus it is critical to optimize real DFT for specific platforms. In this paper, we implement 1D and 2D real DFT on ARMv8 platform which is the flagship architecture of ARM. Real DFT kinds implemented and optimized include R2HC, HC2R, DHT, DCTI-IV, DSTI-IV and are especially optimized when input size is \(2^{q}3^{n}5^{m}\). In order to achieve high performance, optimization is carried out in following aspects: (1) Reduction of the computation complexity of real DFT. (2) Implementation of high performance 1D complex DFT algorithm to support real DFT. (3) For the 2D real DFT, we propose a cache-aware blocking approach to improve cache performance. Experimental results show that: Compared with FFTw 3.3.7, 1D-Float DFT gains 1.52x speedup in average across all real DFT kinds, maximum speedup reaches 1.79x; 1D-Double DFT gains 1.34x speedup in average across all real DFT kinds, maximum speedup reaches 1.61x; 2D-Float DFT gains 1.41x speedup in average across all real DFT kinds, maximum speedup reaches 1.70x; 2D-Double DFT gains 1.10x speedup across all real DFT kinds, maximum speedup reaches 1.25x.


Real Fast Fourier Transform Program optimization ARMv8 



This work is supported by the National Key Research and Development Program of China under Grant No.2017YFB0202105 and No.2016YFE0100300; The National Natural Science Foundation of China under Grant No.61432018, No.61521092 and No.61502405; Key Technology Research and Development Programs of Guangdong Province under Grant No.2015B010108006.


  1. 1.
    Oran Brigham, E.: The Fast Fourier Transform and Its Applications, vol. 1. Prentice Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  2. 2.
    Reddy, B.S., Chatterji, B.N.: An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Process. 5(8), 1266–1271 (1996)CrossRefGoogle Scholar
  3. 3.
    Sorensen, H.V., Jones, D., Heideman, M., Burrus, C.: Real-valued Fast Fourier Transform algorithms. IEEE Trans. Acoust. Speech Signal Process. 35(6), 849–863 (1987)CrossRefGoogle Scholar
  4. 4.
    Pippig, M.: PFFT: an extension of FFTW to massively parallel architectures. SIAM J. Sci. Comput. 35(3), C213–C236 (2013)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Abtahi, T., Kulkarni, A., Mohsenin, T.: Accelerating convolutional neural network with FFT on tiny cores. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4. IEEE (2017)Google Scholar
  6. 6.
    Cecotti, H., Graeser, A.: Convolutional neural network with embedded Fourier Transform for EEG classification. In: 19th International Conference on Pattern Recognition, ICPR 2008, pp. 1–4. IEEE (2008)Google Scholar
  7. 7.
    Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks, pp. 4013–4021 (2016)Google Scholar
  8. 8.
    Lee, B.: FCT-a fact cosine transform. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1984, vol. 9, pp. 477–480. IEEE (1984)Google Scholar
  9. 9.
    Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters, pp. 315–324 (2010)Google Scholar
  10. 10.
    Frigo, M., Johnson, S.G.: FFTW: an adaptive software architecture for the FFT, vol. 3, pp. 1381–1384. IEEE (1998)Google Scholar
  11. 11.
    Li, Y., Zhang, Y.-Q., Liu, Y.-Q., Long, G.-P., Jia, H.-P.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28(1), 90–105 (2013)CrossRefGoogle Scholar
  12. 12.
    Makhoul, J.: A fast cosine transform in one and two dimensions. IEEE Trans. Acoust. Speech Signal Process. 28(1), 27–34 (1980)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Press, W.H.: Numerical Recipes: The Art of Scientific Computing, 3rd edn. Cambridge University Press, New York (2007)zbMATHGoogle Scholar
  14. 14.
    Shao, X., Johnson, S.G.: Type-II/III DCT/DST algorithms with reduced number of arithmetic operations. Signal Process. 88(6), 1553–1564 (2008)CrossRefGoogle Scholar
  15. 15.
    Wang, Z.: On computing the discrete fourier and cosine transforms. IEEE Trans. Acoust. Speech Signal Process. 33(5), 1341–1344 (1985)MathSciNetCrossRefGoogle Scholar
  16. 16.

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Xiao Wang
    • 1
    • 2
  • Haipeng Jia
    • 1
    Email author
  • Zhihao Li
    • 1
    • 2
  • Yunquan Zhang
    • 1
  1. 1.State Key Laboratory of Computer ArchitectureInstitute of Computing Technology, Chinese Academy of SciencesBeijingChina
  2. 2.School of Computer and Control EngineeringUniversity of Chinese Academy of SciencesBeijingChina

Personalised recommendations