An FFT Performance Model for Optimizing General-Purpose Processor Architecture
- 138 Downloads
General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT), due to its flexibility, reliability and practicality. FFT is a representative application intensive in both computation and memory access, optimizing the FFT performance of a GPP also benefits the performances of many other applications. To facilitate the analysis of FFT, this paper proposes a theoretical model of the FFT processing. The model gives out a tight lower bound of the runtime of FFT on a GPP, and guides the architecture optimization for GPP as well. Based on the model, two theorems on optimization of architecture parameters are deduced, which refer to the lower bounds of register number and memory bandwidth. Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.
The above investigations were adopted in the development of Godson-3B, which is an industrial GPP. The optimization techniques deduced from our performance model improve the FFT performance by about 40%, while incurring only 0:8% additional area cost. Consequently, Godson-3B solves the 1024-point single-precision complex FFT in 0:368 μs with about 40Watt power consumption, and has the highest performance-per-watt in complex FFT among processors as far as we know. This work could benefit optimization of other GPPs as well.
Keywordsfast Fourier transform (FFT) general-purpose processor (GPP) performance prediction model vector unit DMA
- Li Y, Zhao L, Lin H, Chow A C, Diamond J R. A performance model for fast Fourier transform. In Proc. the 23 rd International Symposium on Parallel and Distributed Processing, Rome, Italy, May 23–29, 2009, pp.1-11.Google Scholar
- Fraguela B B, Voronenko Y, PÄuschel M. Automatic tuning of discrete Fourier transforms driven by analytical modeling. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT2009). Raleigh, USA, Sept. 12–16, 2009, pp.271-280.Google Scholar
- Norton A, Silberger A J. Parallelization and performance analysis of the Cooley-Tukey FFT algorithm for shared-memory architectures. IEEE Transactions on Computers, 1987, C-36(5): 581–591.Google Scholar
- Gu L, Li X. DFT performance prediction in FFTW. In Proc.the 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC), Newark, USA, Oct. 8–10, 2009.Google Scholar
- Pagiamtzis K, Kulak P G. Empirical performance prediction for IFFT/FFT cores for OFDM systems-on-a-chip. In Proc. the 45th MWSCAS, Tulsa, USA, Aug. 4–7, 2002.Google Scholar
- Sepiashvili D. Performance models and search methods for optimal FFT implementations. [Master's Thesis]. Carnegie Mellon University, 2000.Google Scholar
- Gentleman W M, Sande G. Fast Fourier transforms — For fun and profit. In Proc. the 1966 Fall Joint Computer Conference, San Francisco, USA, Nov. 7–10, 1966, pp.563-578.Google Scholar
- Guan X, Lin H, Fei Y. Design of an application-specific instruction set processor for high-throughput and scalable FFT. In Proc. DATE2009, Dresden, Germany, Mar. 12–16, 2009, pp.1302-1307.Google Scholar
- Statix IV. FFT MegaCore function. http://www.altera.com, Sept. 2010.
- TMS320C6747. Floating-point digital signal processor. http://focus.ti.com/dsp/docs, Sept. 2010.
- Naga K G, Brandon L, Yuri D, Burton S, John M. High performance discrete Fourier transforms on graphics processors. In Proc. the 22nd Int. Conference on Supercomputing, Island of Kos, Greece, Jun. 7–12, 2008, pp.1-12.Google Scholar
- Bader D, Agarwal V. FFTC: Fastest Fourier transform for the IBM Cell broadband engine. In Proc. the 14th IEEE International Conference on High Performance Computing (HiPC), Goa, India, Dec. 18–21, 2007, pp.172-184.Google Scholar
- Ranganathan P, Adve S, Jouppi N P. Performance of image and video processing with general-purpose processors and media ISA extensions. In Proc. the 26th International Symposium on Computer Architecture, Atlanta, USA, May 2–4, 1999, pp.124-135.Google Scholar
- Barkdull J N, Douglas S C. General-purpose microprocessor performance for DSP applications. In Conference Record of the 30th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, USA, Nov. 3–6, 1996, pp.912-916.Google Scholar
- Suh J, Kim E G, Crago S P, Lakshmi S, French M C. A performance analysis of pim, stream processing, and tiled processing on memory-intensive signal processing kernels. In Proc. the 30th Annual International Symposium on Computer Architecture, San Diego, USA, Jun. 9–11, 2003, pp.410-419.Google Scholar
- Chen L, Hu Z, Lin J, Gao G R. Optimizing the fast Fourier transform on a multi-core architecture. In Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), Long Beach, USA, Mar. 26–30, 2007, pp.1-8.Google Scholar
- Frigo M, Johnson S. FFTW: An adaptive software architecture for the FFT. In Proc. the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, USA, May 12–15, 1998, pp.1381-1384.Google Scholar
- Hu W, Wang J, Gao X, Chen Y. Micro-architecture of Godson-3 multi-core processor. In Proc. the 20th Hot Chips (Hotchips 2008), Stanford University, USA, Aug. 26–28, 2008.Google Scholar
- Chellappa S, Franchetti F, PÄueschel M. Computer generation of fast Fourier transforms for the cell broadband engine. In Proc. the 23 rd International Conference on Supercomputing (ICS), York town Heights, USA, Jun. 8–12, 2009, pp.26-35.Google Scholar