Advertisement

FPGA Based High Performance Double-Precision Matrix Multiplication

  • Vinay B. Y. Kumar
  • Siddharth Joshi
  • Sachin B. Patkar
  • H. Narayanan
Article

Abstract

We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, optimized for implementation on high-end FPGAs. It forms the kernel in many important tile-based BLAS algorithms, making an excellent candidate for acceleration. The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to sustain their peak performance except during an initial latency period. Through these designs, the trade-offs involved in terms of local-memory and bandwidth for an FPGA implementation are demonstrated and an analysis is presented for the optimal choice of design parameters. The designs, implemented on a Virtex-5 SX240T FPGA, scale gracefully from 1 to 40 processing elements(PEs) with a less than 1% degradation in the design frequency of 373 MHz. With 40 PEs and a design speed of 373 MHz, a sustained performance of 29.8 GFLOPS is possible with a bandwidth requirement of 750 MB/s for design-II and 5.9 GB/s for design-I. This compares favourably with both related art and general purpose CPU implementations.

Keywords

High performance computing Matrix multiplication Rank-1 scheme FPGA implementation Memory-bandwidth trade-off Scalability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baxter, R., Booth, S., Bull, M., Cawood, G., Perry, J., Parsons, M., Simpson, A., Trew, A., McCormick, A., Smart, G., Smart, R., Cantle, A., Chamberlain, R., Genest, G.: Maxwell—a 64 fpga supercomputer. In: AHS ’07: Proceedings of the Second NASA/ESA Conference on Adaptive Hardware and Systems, pp. 287–294. IEEE Computer Society, Washington, DC, USA (2007)Google Scholar
  2. 2.
    Baxter, R., Booth, S., Bull, M., Cawood, G., Perry, J., Parsons, M., Simpson, A., Trew, A., McCormick, A., Smart, G., Smart, R., Cantle, A., Chamberlain, R., Genest, G.: The fpga high-performance computing alliance parallel toolkit. In: AHS ’07: Proceedings of the Second NASA/ESA Conference on Adaptive Hardware and Systems, pp. 301–310. IEEE Computer Society, Washington, DC, USA (2007)Google Scholar
  3. 3.
    Underwood, K.D., Hemmert, K.S.: Closing the gap: Cpu and fpga trends in sustainable floating-point blas performance. In: FCCM, pp. 219–228. IEEE Computer Society (2004)Google Scholar
  4. 4.
    Zhuo L., Prasanna V.K.: High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Trans. Comput. 57(8), 1057–1071 (2008)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Craven S., Athanas P.: Examining the viability of fpga supercomputing. EURASIP J. Embed. Syst. 2007(1), 13–13 (2007)Google Scholar
  6. 6.
    Kumar, V.B.Y., Joshi, S., Patkar, S.B., Narayanan, H.: Fpga based high performance double-precision matrix multiplication. In: VLSID ’09: Proceedings of the 2009 22nd International Conference on VLSI Design, pp. 341–346. IEEE Computer Society, Washington, DC, USA (2009)Google Scholar
  7. 7.
    Zhuo L., Prasanna V.K.: Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Trans. Parallel Distrib. Syst. 18(4), 433–448 (2007)CrossRefGoogle Scholar
  8. 8.
    Goto, K., van de Geijn, R.: High performance implementation of the level-3 BLAS, accepted 28 Oct 2007Google Scholar
  9. 9.
    Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N.: 64-bit floating-point fpga matrix multiplication. In: FPGA ’05: Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays, pp. 86–95. ACM, New York, USA (2005)Google Scholar
  10. 10.
    Zhuo L., Prasanna V.K.: Scalable and modular algorithms for floating-point matrix multiplication on fpgas. IPDPS 01, 92 (2004)Google Scholar
  11. 11.
    Xilinx Virtex-5 family User GuideGoogle Scholar
  12. 12.
    Kuzmanov, G., van Oijen, W.: Floating-point matrix multiplication in a polymorphic processor. In: International Conference on Field Programmable Technology (ICFPT), Dec 2007, pp. 249–252Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Vinay B. Y. Kumar
    • 1
  • Siddharth Joshi
    • 1
  • Sachin B. Patkar
    • 1
  • H. Narayanan
    • 1
  1. 1.Department of Electrical EngineeringIndian Institute of Technology, BombayMumbaiIndia

Personalised recommendations