Advertisement

Accelerating the 3-D FFT Using a Heterogeneous FPGA Architecture

  • Matthew Anderson
  • Maciej Brodowicz
  • Martin Swany
  • Thomas Sterling
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)

Abstract

Future Exascale architectures will likely make extensive use of computing accelerators such as Field Programmable Gate Arrays (FPGAs) given that these accelerators are very power efficient. Oftentimes, these FPGAs are located at the network interface card (NIC) and switch level in order to accelerate network operations, incorporate contention avoiding routing schemes, and perform computations directly on the NIC and bypass the arithmetic logic unit (ALU) of the CPU. This work explores just such a heterogeneous FPGA architecture in the context of two kernels that are driving applications in leadership machines: the 3-D Fast Fourier Transform (3-D FFT) and Asynchronous Multi-Tasking (AMT). The machine explored here is a DataVortex system which consists of conventional processors but with programmable logic incorporated in the memory architecture. The programmable logic controls the network and is incorporated both in the network interface cards and the network switches and implements a contention avoiding network routing. Both the 3-D FFT and AMT kernels show compelling performance for deployment to FFT driven applications in both molecular dynamics and density functional theory.

Keywords

FFT FPGA Heterogeneous systems Asynchronous multitasking High radix networks Contention avoiding routing 

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
    Hydrodynamics Challenge Problem. Technical report LLNL-TR-490254, Lawrence Livermore National LaboratoryGoogle Scholar
  6. 6.
    Legion programming system. http://legion.stanford.edu/
  7. 7.
    Open Community Runtime. https://01.org/open-community-runtime
  8. 8.
  9. 9.
  10. 10.
  11. 11.
    Alpha data (2016). www.alpha-data.com
  12. 12.
    Bittware (2016). www.bittware.com
  13. 13.
    FFTW (2016). www.fftw.org
  14. 14.
    GROMACS (2016). www.gromacs.org
  15. 15.
    NAS parallel benchmarks (2016). https://www.nas.nasa.gov/publications/npb.html
  16. 16.
    NetFPGA project (2016). netfpga.org
  17. 17.
    VASP (2017). www.vasp.at
  18. 18.
    Anderson, M., Brodowicz, M., Kulkarni, A., Sterling, T.: Performance modeling of gyrokinetic toroidal simulations for a many-tasking runtime system. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 136–157. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10214-6_7 Google Scholar
  19. 19.
    Deniziak, S., Tomaszewski, R.: Contention-avoiding custom topology generation for network-on-chip. In: Proceedings of the 2009 12th International Symposium on Design and Diagnostics of Electronic Circuits and Systems, DDECS 2009, pp. 234–237. IEEE Computer Society, Washington, DC, USA (2009).  https://doi.org/10.1109/DDECS.2009.5012136
  20. 20.
    Dongarra, J.: Performance of various computers using standard linear equations software. Technical report CS-89-85, University of Tennesse Computer Science (2014). http://www.netlib.org/benchmark/performance.pdf
  21. 21.
    Hendry, G., Rodrigues, A.: SST: a simulator for exascale co-design. In: Proceedings of the ASCR/ASC Exascale Research Conference (2012)Google Scholar
  22. 22.
    Hoefler, T.: Seventh green graph 500 list (2016). http://green.graph500.org/
  23. 23.
    Imam, S., Sarkar, V.: Habanero-Java library: a Java 8 framework for multicore programming. In: 11th International Conference on the Principles and Practice of Programming on the Java Platform: Virtual Machines, Languages, and Tools (PPPJ 2014), September 2014Google Scholar
  24. 24.
    Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable Dragonfly topology. In: Proceedings of the 35th International Symposium on Compute Architecture, ISCA 2008. IEEE (2008)Google Scholar
  25. 25.
    Kumar, V., Zheng, Y., Cave, V., Budimlic, Z., Sarkar, V.: HabaneroUPC++: a compiler-free PGAS library. In: 8th International Conference on Partitioned Global Address Space Programming Models (PGAS14), October 2014Google Scholar
  26. 26.
    Leber, C., Geib, B., Litz, H.: High frequency trading acceleration using FPGAs. In: 2011 21st International Conference on Field Programmable Logic and Applications, pp. 317–322, September 2011Google Scholar
  27. 27.
    Leon, E., Karlin, I., Grant, R.: Optimizing explicit hydrodynamics for power, energy, and performance. In: 2015 IEEE International Conference on Cluster Computing (CLUSTER), pp. 11–21, September 2015Google Scholar
  28. 28.
    Lockwood, J., Gupte, A., Mehta, N., Vissers, K.A.: A low-latency library in FPGA hardware for high-frequency trading. In: IEEE 20th Annual Symposium on High-Performance Interconnects, pp. 9–16, August 2012Google Scholar
  29. 29.
    Majeti, D., Sarkar, V.: Heterogeneous Habanero-C (H2C): a portable programming model for heterogeneous processors. In: Programming Models, Languages and Compilers for Manycore and Heterogeneous Architectures (PLC), May 2015Google Scholar
  30. 30.
    Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., Oskin, M.: Grappa: a latency-tolerant runtime for large-scale irregular applications. In: International Workshop on Rack-Scale Computing (WRSC w/EuroSys), April 2014Google Scholar
  31. 31.
    Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kale, L., Schulten, K.: Scalable molecular dynamics with NAMD. J. Comput. Chem. 26, 1781–1802 (2005)CrossRefGoogle Scholar
  32. 32.
    Reed, C.: Means and apparatus for a scaleable congestion free switching system with intelligent control III, US Patent 7835278, November 2010Google Scholar
  33. 33.
    Sarkar, V.: Habanero-Scala: Async-finish programming in Scala. In: The Third Scala Workshop (Scala Days 2012), April 2012Google Scholar
  34. 34.
    Treichler, S., Bauer, M., Aiken, A.: Realm: an event-based low-level runtime for distributed memory architectures. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT 2014, pp. 263–276. ACM, New York, NY, USA (2014).  https://doi.org/10.1145/2628071.2628084
  35. 35.
    Essmann, U., Perera, L., Berkowitz, M.L., Darden, T., Lee, H., Pedersen, L.G.: A smooth particle mesh Ewald method. J. Chem. Phys. 103, 8577–8593 (1995)CrossRefGoogle Scholar
  36. 36.
    Zhao, J., Zhou, Q., Cai, Y.: Fast congestion-aware timing-driven placement for island FPGA. In: Proceedings of the 2009 12th International Symposium on Design and Diagnostics of Electronic Circuits and Systems, DDECS 2009, pp. 24–27. IEEE Computer Society, Washington, DC, USA (2009).  https://doi.org/10.1109/DDECS.2009.5012092

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Matthew Anderson
    • 1
  • Maciej Brodowicz
    • 1
  • Martin Swany
    • 1
  • Thomas Sterling
    • 1
  1. 1.School of Informatics and Computing, Center for Research in Extreme Scale TechnologiesIndiana UniversityBloomingtonUSA

Personalised recommendations