Abstract
General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT), due to its flexibility, reliability and practicality. FFT is a representative application intensive in both computation and memory access, optimizing the FFT performance of a GPP also benefits the performances of many other applications. To facilitate the analysis of FFT, this paper proposes a theoretical model of the FFT processing. The model gives out a tight lower bound of the runtime of FFT on a GPP, and guides the architecture optimization for GPP as well. Based on the model, two theorems on optimization of architecture parameters are deduced, which refer to the lower bounds of register number and memory bandwidth. Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.
The above investigations were adopted in the development of Godson-3B, which is an industrial GPP. The optimization techniques deduced from our performance model improve the FFT performance by about 40%, while incurring only 0:8% additional area cost. Consequently, Godson-3B solves the 1024-point single-precision complex FFT in 0:368 μs with about 40Watt power consumption, and has the highest performance-per-watt in complex FFT among processors as far as we know. This work could benefit optimization of other GPPs as well.
Similar content being viewed by others
References
Frigo M, Johnson S. The design and implementation of FFTW3. Proceedings of the IEEE, Feb. 2005, 93(2): 216–231.
Franchetti F, Püschel M, Voronenko Y, Chellappa S, Moura J M F. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, 2009, 26(6): 90–102.
Li Y, Zhao L, Lin H, Chow A C, Diamond J R. A performance model for fast Fourier transform. In Proc. the 23 rd International Symposium on Parallel and Distributed Processing, Rome, Italy, May 23–29, 2009, pp.1-11.
Fraguela B B, Voronenko Y, PÄuschel M. Automatic tuning of discrete Fourier transforms driven by analytical modeling. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT2009). Raleigh, USA, Sept. 12–16, 2009, pp.271-280.
Norton A, Silberger A J. Parallelization and performance analysis of the Cooley-Tukey FFT algorithm for shared-memory architectures. IEEE Transactions on Computers, 1987, C-36(5): 581–591.
Cvetanović Z. Performance analysis of the FFT algorithm on a shared-memory parallel architecture. IBM Journal of Research and Development, 1987, 31(4): 435–451.
Gu L, Li X. DFT performance prediction in FFTW. In Proc.the 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC), Newark, USA, Oct. 8–10, 2009.
Pagiamtzis K, Kulak P G. Empirical performance prediction for IFFT/FFT cores for OFDM systems-on-a-chip. In Proc. the 45th MWSCAS, Tulsa, USA, Aug. 4–7, 2002.
Singer B, Veloso M. Learning to construct fast signal processing implementations. Journal of Machine Learning Research, 2003, 3: 887–919.
Sepiashvili D. Performance models and search methods for optimal FFT implementations. [Master's Thesis]. Carnegie Mellon University, 2000.
Cooley J W, Tukey J W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19(90): 297–301.
Bergland G. Fast Fourier transform hardware implementations—An overview. IEEE Transactions on Audio and Electroacoustics, 1969, 17(2): 104–108.
Gentleman W M, Sande G. Fast Fourier transforms — For fun and profit. In Proc. the 1966 Fall Joint Computer Conference, San Francisco, USA, Nov. 7–10, 1966, pp.563-578.
Brenner N. Fast Fourier transform of externally stored data. IEEE Transactions on Audio and Electroacoustics, 1969, 17(2): 128–132.
Guan X, Lin H, Fei Y. Design of an application-specific instruction set processor for high-throughput and scalable FFT. In Proc. DATE2009, Dresden, Germany, Mar. 12–16, 2009, pp.1302-1307.
Statix IV. FFT MegaCore function. http://www.altera.com, Sept. 2010.
TMS320C6747. Floating-point digital signal processor. http://focus.ti.com/dsp/docs, Sept. 2010.
Naga K G, Brandon L, Yuri D, Burton S, John M. High performance discrete Fourier transforms on graphics processors. In Proc. the 22nd Int. Conference on Supercomputing, Island of Kos, Greece, Jun. 7–12, 2008, pp.1-12.
Bader D, Agarwal V. FFTC: Fastest Fourier transform for the IBM Cell broadband engine. In Proc. the 14th IEEE International Conference on High Performance Computing (HiPC), Goa, India, Dec. 18–21, 2007, pp.172-184.
Ranganathan P, Adve S, Jouppi N P. Performance of image and video processing with general-purpose processors and media ISA extensions. In Proc. the 26th International Symposium on Computer Architecture, Atlanta, USA, May 2–4, 1999, pp.124-135.
Barkdull J N, Douglas S C. General-purpose microprocessor performance for DSP applications. In Conference Record of the 30th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA, Nov. 3–6, 1996, pp.912-916.
Suh J, Kim E G, Crago S P, Lakshmi S, French M C. A performance analysis of pim, stream processing, and tiled processing on memory-intensive signal processing kernels. In Proc. the 30th Annual International Symposium on Computer Architecture, San Diego, USA, Jun. 9–11, 2003, pp.410-419.
Chen L, Hu Z, Lin J, Gao G R. Optimizing the fast Fourier transform on a multi-core architecture. In Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), Long Beach, USA, Mar. 26–30, 2007, pp.1-8.
Frigo M, Johnson S. FFTW: An adaptive software architecture for the FFT. In Proc. the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, USA, May 12–15, 1998, pp.1381-1384.
Hu W, Wang J, Gao X, Chen Y. Micro-architecture of Godson-3 multi-core processor. In Proc. the 20th Hot Chips (Hotchips 2008), Stanford University, USA, Aug. 26–28, 2008.
Hu W, Wang J, Gao X, Chen Y, Liu Q, Li G. Godson-3: A scalable multicore RISC processor with x86 emulation. IEEE Micro, 2009, 29(2): 17–29.
Chellappa S, Franchetti F, PÄueschel M. Computer generation of fast Fourier transforms for the cell broadband engine. In Proc. the 23 rd International Conference on Supercomputing (ICS), York town Heights, USA, Jun. 8–12, 2009, pp.26-35.
Author information
Authors and Affiliations
Corresponding authors
Additional information
This work is partially supported by the National Science and Technology Major Project under Grant Nos. 2009ZX01028-002-003, 2009ZX01029-001-003, 2010ZX01036-001-002, and the National Natural Science Foundation of China under Grant Nos. 61050002, 61003064, 60921002.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Li, L., Chen, YJ., Liu, DF. et al. An FFT Performance Model for Optimizing General-Purpose Processor Architecture. J. Comput. Sci. Technol. 26, 875–889 (2011). https://doi.org/10.1007/s11390-011-0186-z
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-011-0186-z