An FFT Performance Model for Optimizing General-Purpose Processor Architecture

  • Ling LiEmail author
  • Yun-Ji ChenEmail author
  • Dao-Fu LiuEmail author
  • Cheng QianEmail author
  • Wei-Wu HuEmail author


General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT), due to its flexibility, reliability and practicality. FFT is a representative application intensive in both computation and memory access, optimizing the FFT performance of a GPP also benefits the performances of many other applications. To facilitate the analysis of FFT, this paper proposes a theoretical model of the FFT processing. The model gives out a tight lower bound of the runtime of FFT on a GPP, and guides the architecture optimization for GPP as well. Based on the model, two theorems on optimization of architecture parameters are deduced, which refer to the lower bounds of register number and memory bandwidth. Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.

The above investigations were adopted in the development of Godson-3B, which is an industrial GPP. The optimization techniques deduced from our performance model improve the FFT performance by about 40%, while incurring only 0:8% additional area cost. Consequently, Godson-3B solves the 1024-point single-precision complex FFT in 0:368 μs with about 40Watt power consumption, and has the highest performance-per-watt in complex FFT among processors as far as we know. This work could benefit optimization of other GPPs as well.


fast Fourier transform (FFT) general-purpose processor (GPP) performance prediction model vector unit DMA 

Supplementary material

11390_2011_186_MOESM1_ESM.pdf (123 kb)
(PDF 123 KB)


  1. [1]
    Frigo M, Johnson S. The design and implementation of FFTW3. Proceedings of the IEEE, Feb. 2005, 93(2): 216–231.CrossRefGoogle Scholar
  2. [2]
    Franchetti F, Püschel M, Voronenko Y, Chellappa S, Moura J M F. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90–102. CrossRefGoogle Scholar
  3. [3]
    Li Y, Zhao L, Lin H, Chow A C, Diamond J R. A performance model for fast Fourier transform. In Proc. the 23 rd International Symposium on Parallel and Distributed Processing, Rome, Italy, May 23–29, 2009, pp.1-11.Google Scholar
  4. [4]
    Fraguela B B, Voronenko Y, PÄuschel M. Automatic tuning of discrete Fourier transforms driven by analytical modeling. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT2009). Raleigh, USA, Sept. 12–16, 2009, pp.271-280.Google Scholar
  5. [5]
    Norton A, Silberger A J. Parallelization and performance analysis of the Cooley-Tukey FFT algorithm for shared-memory architectures. IEEE Transactions on Computers, 1987, C-36(5): 581–591.Google Scholar
  6. [6]
    Cvetanović Z. Performance analysis of the FFT algorithm on a shared-memory parallel architecture. IBM Journal of Research and Development, 1987, 31(4): 435–451.CrossRefGoogle Scholar
  7. [7]
    Gu L, Li X. DFT performance prediction in FFTW. In Proc.the 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC), Newark, USA, Oct. 8–10, 2009.Google Scholar
  8. [8]
    Pagiamtzis K, Kulak P G. Empirical performance prediction for IFFT/FFT cores for OFDM systems-on-a-chip. In Proc. the 45th MWSCAS, Tulsa, USA, Aug. 4–7, 2002.Google Scholar
  9. [9]
    Singer B, Veloso M. Learning to construct fast signal processing implementations. Journal of Machine Learning Research, 2003, 3: 887–919.MathSciNetzbMATHGoogle Scholar
  10. [10]
    Sepiashvili D. Performance models and search methods for optimal FFT implementations. [Master's Thesis]. Carnegie Mellon University, 2000.Google Scholar
  11. [11]
    Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19(90): 297–301.MathSciNetzbMATHCrossRefGoogle Scholar
  12. [12]
    Bergland G. Fast Fourier transform hardware implementations—An overview. IEEE Transactions on Audio and Electroacoustics, 1969, 17(2): 104–108.CrossRefGoogle Scholar
  13. [13]
    Gentleman W M, Sande G. Fast Fourier transforms — For fun and profit. In Proc. the 1966 Fall Joint Computer Conference, San Francisco, USA, Nov. 7–10, 1966, pp.563-578.Google Scholar
  14. [14]
    Brenner N. Fast Fourier transform of externally stored data. IEEE Transactions on Audio and Electroacoustics, 1969, 17(2): 128–132.CrossRefGoogle Scholar
  15. [15]
    Guan X, Lin H, Fei Y. Design of an application-specific instruction set processor for high-throughput and scalable FFT. In Proc. DATE2009, Dresden, Germany, Mar. 12–16, 2009, pp.1302-1307.Google Scholar
  16. [16]
    Statix IV. FFT MegaCore function., Sept. 2010.
  17. [17]
    TMS320C6747. Floating-point digital signal processor., Sept. 2010.
  18. [18]
    Naga K G, Brandon L, Yuri D, Burton S, John M. High performance discrete Fourier transforms on graphics processors. In Proc. the 22nd Int. Conference on Supercomputing, Island of Kos, Greece, Jun. 7–12, 2008, pp.1-12.Google Scholar
  19. [19]
    Bader D, Agarwal V. FFTC: Fastest Fourier transform for the IBM Cell broadband engine. In Proc. the 14th IEEE International Conference on High Performance Computing (HiPC), Goa, India, Dec. 18–21, 2007, pp.172-184.Google Scholar
  20. [20]
    Ranganathan P, Adve S, Jouppi N P. Performance of image and video processing with general-purpose processors and media ISA extensions. In Proc. the 26th International Symposium on Computer Architecture, Atlanta, USA, May 2–4, 1999, pp.124-135.Google Scholar
  21. [21]
    Barkdull J N, Douglas S C. General-purpose microprocessor performance for DSP applications. In Conference Record of the 30th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA, Nov. 3–6, 1996, pp.912-916.Google Scholar
  22. [22]
    Suh J, Kim E G, Crago S P, Lakshmi S, French M C. A performance analysis of pim, stream processing, and tiled processing on memory-intensive signal processing kernels. In Proc. the 30th Annual International Symposium on Computer Architecture, San Diego, USA, Jun. 9–11, 2003, pp.410-419.Google Scholar
  23. [23]
    Chen L, Hu Z, Lin J, Gao G R. Optimizing the fast Fourier transform on a multi-core architecture. In Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), Long Beach, USA, Mar. 26–30, 2007, pp.1-8.Google Scholar
  24. [24]
    Frigo M, Johnson S. FFTW: An adaptive software architecture for the FFT. In Proc. the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, USA, May 12–15, 1998, pp.1381-1384.Google Scholar
  25. [25]
    Hu W, Wang J, Gao X, Chen Y. Micro-architecture of Godson-3 multi-core processor. In Proc. the 20th Hot Chips (Hotchips 2008), Stanford University, USA, Aug. 26–28, 2008.Google Scholar
  26. [26]
    Hu W, Wang J, Gao X, Chen Y, Liu Q, Li G. Godson-3: A scalable multicore RISC processor with x86 emulation. IEEE Micro, 2009, 29(2): 17–29.CrossRefGoogle Scholar
  27. [27]
    Chellappa S, Franchetti F, PÄueschel M. Computer generation of fast Fourier transforms for the cell broadband engine. In Proc. the 23 rd International Conference on Supercomputing (ICS), York town Heights, USA, Jun. 8–12, 2009, pp.26-35.Google Scholar

Copyright information

© Springer Science+Business Media, LLC & Science Press, China 2011

Authors and Affiliations

  1. 1.Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.Loongson Technologies Corporation LimitedBeijingChina

Personalised recommendations