A Parallel 1-D FFT Implementation Method for Multi-core Vector Processors

  • Zhong LiuEmail author
  • Xi Tian
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 994)


This paper presents an efficient parallel 1-D FFT implementation method based on the architecture features of multi-core vector processor. It divides the parallel computation of large-point 1-D FFT into the (n-m)-level parallel FFT computation and M-point parallel FFT computation according to the number of data points M that can be accommodated in the global cache (GC). The parallel FFT computation for each stage are performed using a shared DDR data method in (n-m)-level FFT computation. In the M-point parallel FFT computation, a parallel FFT computation method based on the matrix Fourier algorithm is designed, it converts the original M-point 1-D FFT computation into a 2-D FFT computation, and achieves parallel FFT computation using a shared GC data method, which avoids multiple data transfers between GC and AM and reduces data transmission overhead. Merge Column FFT computation with factor matrix multiplication and column FFT computation results in the AM, which further reduces the number of data transfer between AM and GC, and can significantly improve the efficiency of M-point FFT computation. The experimental results on Matrix show that the average speedup of the single-core single-precision 1-D FFT is 8.26 times and the average speedup of the dual-core single-precision 1-D FFT is 6.78 times compared with the TMS320C6678 with the same frequency.


Multi-core vector processors Large-point 1-D Fast Fourier Transform Matrix Fourier algorithm Parallel 


  1. 1.
    Franchetti, F., Puschel, M., Voronenko, Y., Chellappa, S., Moura, J.M.: Discrete fourier transform on multicore. Signal Process. Mag. IEEE 26, 90–102 (2009)CrossRefGoogle Scholar
  2. 2.
    Gu, L., Siegel, J., Li, X.: Using GPUs to compute large out-of-card FFTs. In: Proceedings of the International Conference on Supercomputing, pp. 255–264. ACM (2011)Google Scholar
  3. 3.
    Pekurovsky, D.: P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions. SIAM J. Sci. Comput. 34, 192–209 (2012)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Pippig, M.: PFFT: an extension of FFTW to massively parallel architectures. SIAM J. Sci. Comput. 35, 213–236 (2013)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Takahashi, D.: Implementation of parallel 1-D FFT on GPU clusters. In: 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE), pp. 174–180, December 2013Google Scholar
  6. 6.
    Tang, P.T.P., Park, J., Kim, D., Petrov, V.: A framework for low-communication 1-D FFT. Sci. Program. 21, 181–195 (2013)Google Scholar
  7. 7.
    Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y.: Intel math kernel library. High-Performance Computing on the Intel® Xeon Phi™, pp. 167–188. Springer, Cham (2014). Scholar
  8. 8.
    Cooley, J.W., Turkey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Goedecker, S.: Fast Radix 2, 3, 4, and 5 kernels for fast Fourier Transformations on computers with overlapping multiply-add instructions. SIAM J. Sci. Comput. 18(6), 1605–1611 (1997)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Karner, H., Auer, M., Ueberhuber, C.W.: Multiply-add optimized FFT kernels. Math. Model. Methods Appl. Sci. 11(01), 105–117 (2001)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Liu, Z., Chen, H., Xiang, H.V.: Vectorization of accelerating fast fourier transform computation based on fused multiply-add instruction. J. Natl. Univ. Def. Technol. 37(2), 72–78 (2015)Google Scholar
  12. 12.
    HE, T., Zhu, D.: Design and implementation of large-point 1D FFT on GPU. Comput. Eng. Sci. 35(11), 34–41 (2013)Google Scholar
  13. 13.
    Frigo, M., Johnson, S.G.: The design and implementation of FFTW. Proc. IEEE 93(2), 216–231 (2005)CrossRefGoogle Scholar
  14. 14.
    Takahashi, D.: A parallel 1-D FFT algorithm for the Hitachi SR8000. Parallel Comput. 29(6), 679–690 (2003)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Takahashi, D., Uno, A., Yokokawa, M.: An implementation of Parallel 1-D FFT on the K computer. Int. Conf. High Perform. Comput. Commun. 248(4), 344–350 (2012)Google Scholar
  16. 16.
    Park, J., Bikshandi, G., Vaidyanathan, K., Tang, P.T.P., Dubey, P., Kim, D.: Tera-scale 1D FFT with low communication algorithm and Intel® Xeon Phi™ coprocessors. In: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, vol. 31, no. 12, p. 34. ACM (2013)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.College of ComputerNational University of Defense TechnologyChangshaChina

Personalised recommendations